Code Monkey home page Code Monkey logo

Comments (7)

Tasilee avatar Tasilee commented on August 24, 2024 1

My concern with any alternate term is that users will have a concept about the term "duplicate". I agree that the way you would think about it depends on usage. In most statistics (as SDM), you do not want occurrence records weighted by 'duplicates'.

Again, the key issue here is that the test FLAGS a POTENTIAL duplicate and it is then up to the user to decide what if anything needs to be done depending on their purpose (FFU).

Our wording of the description would seem fitting...

"The record appears to be a duplicate based on a combination of taxon name and/or collector and/or collector number and/or locality and/or date/time".

I am therefore happy to leave this as is.

from bdq.

stanblum avatar stanblum commented on August 24, 2024 1

Context could/should sort this out. In the first case, records that duplicate a set of values for relevant fields, the "status" of being a duplicate can't be a property of the record independently of its context; it's a property of the record (and at least one other) within data set, i.e., a subset. For that context (a particular dataset) then, would "duplicate_records" work? Is this necessary? Is the converse test more appropriate: all records in this dataset unique -- tested and passed.

For the duplicate specimen case, "duplicate_specimen of X" is something that could stay with the duplicate. But some additional details might be needed to nail down how far duplicate extends -- same "biological individual", same collection date?

from bdq.

ArthurChapman avatar ArthurChapman commented on August 24, 2024 1

Thanks Lee and Stan. You have convinced me that "duplicate" will work in context. Lee has explained that the description of the test in the Tests and Assertions places the test in context. I will check on the Vocabulary we are developing and make sure our definition uses context.

I will leave the issue open for a while to see if anyone else wants to comment.

from bdq.

ArthurChapman avatar ArthurChapman commented on August 24, 2024

from bdq.

allankv avatar allankv commented on August 24, 2024

Indeed, to determine that a record is duplicated (or unique - DQ Dimension, in the conceptual framework context) we have to consider the dataset context, as mentioned by Stan, but also it depends on the Use Case context, as mentioned by Lee. For example, in a given context (A), users may want to identify if there is another record in the dataset that have the same species name and occur in the same exact date and exact coodinates. But, in another context (B), users may want to identify the records in the dataset that have the same species name and occur in the same year and in the same area/cell of 10km x 10km (just naive examples, probably not realistic).

One way to measure it, unsing the framework, would be 1/(D+1) where D is the number of records "duplicated/similar" according to (A) or (B) definition. Ex.: if a record have another 3 similar records in the dataset, the "uniqueness" measure (specificaly in this Use Case context) would be equal 1/(3+1)= 0.25 and in this context. But in the same context, it would require a "uniqueness" = 1 to be considered fit for use, for example, so this record is not fit for use in the context of the dataset and the Use Case.

This example was related to DQ measure assigned to records, but based on those measures, we could generate a DQ mesure assigned to the dataset/multi records. Ex.: a dataset has uniqueness equal 0.87, where this measure was obtained based on the average of the "uniqueness measure" of each record in this dataset. So, if all the records has uniqueness equal 1, the dataset would have uniqueness equal 1 too, otherwise, DQ enhancement actions would be necessary to increase this measure, e.g. remove/filter records that doesn't have "uniqueness" equal 1.

from bdq.

qgroom avatar qgroom commented on August 24, 2024

from bdq.

ArthurChapman avatar ArthurChapman commented on August 24, 2024

At this stage it seems to be OK to use Duplicate, however, we must be careful to make sure that the definition is sure to mention context. I will now close this discussion. Thank you all for participating.

from bdq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.