Comments (6)
Personally I don't think that's a good idea as it forces a dependency on users who might not otherwise want to do their wrangling using {janitor}. The pages should be as agnostic in that regard as possible.
from pangaear.
A way to address the issue of multiple repeat column names for longer variable names, without requiring external dependencies, is to change the behaviour of the internal function pangaer:::pang_GET()
, which is called by pg_data()
. This is where it creates the tibble from the downloaded data. Currently it calls:
tibble::as_tibble(dat, .name_repair = "minimal")
But it would be a better default behaviour to use:
tibble::as_tibble(dat, .name_repair = "check_unique")
# OR
tibble::as_tibble(dat, .name_repair = "unique")
To ensure backward compatibility the package maintainer could add the argument .name_repair = "minimal"
to pg_data()
. This would allow users to change this argument as it is passed from pg_data()
to pang_GET()
to as_tibble()
.
Currently because tibble::as_tibble(dat, .name_repair = "minimal")
limits the character length of column names, if a variable has a long name this tends to cut off the units of measurement. In addition to the issue of columns having multiples of the same name.
from pangaear.
@naupaka @gavinsimpson took over maintaining this pkg, i'll hand it over to them
from pangaear.
Hi @japhir -- thanks for the note. I would be a bit worried about changing the column names in any sort of automated fashion, since it could break a lot of old code from users that might rely on the old names. I think if a user wants to run janitor::clean_names()
then they could do so, since they would then be aware of the consequences.
An alternative would be some sort of warning if there are column name issues and perhaps a suggestion/message to the user to consider using janitor::clean_names()
. Would that be a reasonable alternative from your perspective?
from pangaear.
Yeah I agree that doing this always would probably be too drastic. Ideally we would make sure that future data packages have something like an elaborate description of what each column means and what units it has in the metadata, while the column names still have a letters, periods/underscores and numbers (not as first character) and are guaranteed to be unique. Best would be if they would adhere to some kind of ontology where for example age is always called age and always has the same units (e.g. Ma or ka) and d18O and d13C would always be called that, with the metadata indicating whether they have been adjusted for species-specific vital effects etc.
Perhaps there could be a link between the full column name and the tidied up column name so that if you ever want to get back to what was written originally you can still do so?
I agree that the easiest implementation would be to just write a suggestion message.
In and of itself, having such column names is not too much of a problem because (at least when using the tidyverse) you can wrap them with backtics "`". However, this almost always is very annoying to type and doesn't autocomplete ;-). And this does not work if any of the full column names are duplicates.
from pangaear.
Maybe pangaear could also reexport janitor::clean_names()
to make it easier for the user? That way when they see the message they can fix the names right away without having to go and install and load janitor.
from pangaear.
Related Issues (20)
- some datasets failing on pg_data HOT 1
- some datasets require login HOT 1
- Datasets with png files failing HOT 7
- Add a vignette
- Add GitHub topics HOT 3
- `pg_data` bug: Pangaea I think switched many datasets download setup
- set download directory HOT 10
- parse metadata in data text files
- response content type header changed - causing pg_data to fail HOT 1
- vcr-ify tests where possible
- pg_search: searching for datasets that cross 180/-180 HOT 14
- test fixture problem, related to yaml pkg HOT 1
- Vignette is missing title
- vcr cache pg_data tests
- consistent event handling HOT 3
- error on oldrel macos
- Support passing bearer token for authorization (allows downloading protected datasets) and use HTTP content negotiation HOT 1
- Maintenance status / help needed? HOT 4
- read_meta fails with large multi event data sets
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pangaear.