Comments (3)
Hi @ericwu17,
I'm away from my laptop for a bit, but I'll take a look when I'm back in a couple of weeks. In the meantime feel free to submit a PR with updates to the encoding.
If I'm remembering correctly, I tried to run the python script and got an error (I don't use python very often). I do check for encoding in data-raw/taylor-lyrics.R, and that isn't showing any unexpected errors, as in your example, so I'm not sure what is going on.
from taylor.
I've done a little bit of digging.
First, here's what I see when I try to run the python script:
$ python data-raw/fix-chars.py
Traceback (most recent call last):
File "/Users/jakethompson/Documents/GIT/packages/taylor/data-raw/fix-chars.py", line 34, in <module>
os.chdir(join(working_dir, raw_lyric_dir, album))
FileNotFoundError: [Errno 2] No such file or directory: '/Users/jakethompson/Documents/GIT/packages/taylor/data-raw/data-raw/lyrics/10d_midnights-late-night-edition'
When I search for non-ASCII characters, I see only 17 instances, which are all intentional (e.g., the "é" in café, rosé):
library(tidyverse)
library(taylor)
taylor_all_songs %>%
select(album_name, track_name, lyrics) %>%
unnest(lyrics) %>%
select(album_name, track_name, line, lyric) %>%
mutate(across(where(is.character), stringi::stri_enc_isascii,
.names = "{.col}_ascii"
)) %>%
filter(!if_all(ends_with("ascii"))) %>%
mutate(
ascii_flag = map(lyric,
.f = function(.x) {
str_split(.x, "") %>%
flatten_chr() %>%
enframe() %>%
mutate(ascii = map_lgl(
value,
stringi::stri_enc_isascii
)) %>%
filter(!ascii) %>%
select(value, ascii)
}
)
) %>%
unnest(ascii_flag) %>%
select(album_name, track_name, lyric, ascii, value)
#> # A tibble: 17 × 5
#> album_name track_name lyric ascii value
#> <chr> <chr> <chr> <lgl> <chr>
#> 1 Fearless (Taylor's Version) White Horse (Taylor's Version) "May… FALSE ï
#> 2 Red Begin Again "But… FALSE é
#> 3 Red Begin Again "But… FALSE é
#> 4 Red Begin Again "But… FALSE é
#> 5 Red Begin Again "But… FALSE é
#> 6 Red (Taylor's Version) Begin Again (Taylor's Version) "But… FALSE é
#> 7 Red (Taylor's Version) Begin Again (Taylor's Version) "But… FALSE é
#> 8 Red (Taylor's Version) Begin Again (Taylor's Version) "But… FALSE é
#> 9 Red (Taylor's Version) Begin Again (Taylor's Version) "But… FALSE é
#> 10 Red (Taylor's Version) Nothing New (Taylor's Version)… "Peo… FALSE é
#> 11 Lover You Need To Calm Down "But… FALSE ó
#> 12 folklore the 1 "I h… FALSE é
#> 13 folklore the 1 "Ros… FALSE é
#> 14 folklore the last great american dynasty "And… FALSE é
#> 15 folklore the last great american dynasty "And… FALSE í
#> 16 evermore champagne problems "Dom… FALSE é
#> 17 Midnights Maroon "\"Y… FALSE é
And I'm not able to reproduce the problem from your example:
taylor_all_songs %>%
select(album_name, track_name, lyrics) %>%
unnest(lyrics) %>%
filter(str_detect(lyric, "a brave man"))
#> # A tibble: 1 × 6
#> album_name track_name line lyric element element_artist
#> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 THE TORTURED POETS DEPARTMENT The Black Dog 18 You … Verse 2 Taylor Swift
Created on 2024-04-29 with reprex v2.1.0
from taylor.
Hi wjakethompson, thanks for the detailed response!
I've opened a pull request (#45) which fixes the issue with the python script. You should now be able to run the script and see the lyrics in the lyrics-raw
directory become updated.
If you'd like, I can also open another pull request with changes generated by running the python script.
It's curious that you can't reproduce the issue when interacting with the lyrics through R. I am not familiar with how R works, but is it the case that the data-raw/taylor-lyrics.R
script will load the data and then do some pre-processing? If so, that explains why there's an issue with the files in the raw-lyrics
folder but the lyrics are correct when you use R to search through them.
from taylor.
Related Issues (20)
- Album comparison palette
- Release taylor 0.2.0
- Add "Birch" to singles
- NEW ALBUM ALERT: Red (Taylor's Version)
- Release taylor 0.2.1
- Preserve names when creating a color palette
- Release taylor 1.0.0
- NEW ALBUM ALERT: Midnights
- Release taylor 2.0.0
- New feature
- Release taylor 2.0.1
- NEW ALBUM ALERT: Speak Now (Taylor's Version)
- Release taylor 3.0.0 HOT 1
- NEW ALBUM ALERT: 1989 (Taylor's Version)
- Release taylor 3.0.0
- Add documentation about working with lyrics data
- NEW ALBUM ALERT: Tortured Poets Department
- Add getting started vignette
- Release taylor 3.1.0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from taylor.