Code Monkey home page Code Monkey logo

Comments (25)

suntong avatar suntong commented on August 15, 2024 2
pbpaste | cascadia --in --out --css 'html > head > script' --piece url='attr[src]'

See

"-i", "opt_piece_script.html", "-o", "-c", "html > head > script", "-p", "SourceJS=ATTR:src",

SourceJS
foo.js
bar.js
baz.js

from cascadia.

hoshsadiq avatar hoshsadiq commented on August 15, 2024 1

The issue with using -p is that it creates columns (with headers). My proposal is getting attributes for an individual elements. in my case I'm downloading the latest version of a file automatically, there's no other way to get "latest" version. example:

<html>
<head></head>
<body>
...
<div>
<a href="/files/plugin-name/download?version=1.25">1.25</a>
</div>
...
</body>
</html>

I want to be able to retrieve only the value of href with nothing else, as well as innerText of that anchor.

from cascadia.

mazznoer avatar mazznoer commented on August 15, 2024 1

@hoshsadiq

You can do it using pup.

pup -f file.html 'body div a attr{href}'

from cascadia.

suntong avatar suntong commented on August 15, 2024 1

Hmm... please try giving a minimal reproducible example. Else I won't be able to guess what the problem is.

from cascadia.

suntong avatar suntong commented on August 15, 2024 1

Also, you didn't say which version you're using. Are you using the v1.2.7?

Sorry, trying to rush out some reply before getting back to my burning issue at hand...

from cascadia.

suntong avatar suntong commented on August 15, 2024 1

Indeed, it looks like a bug. Will look into it when I have some time. Meanwhile,

CC: @himcc, I've replicated problem @0xdevalias reported, seems to be logic problem with attr selection. Do you have some time to look into it please?

Hope the final interface would be:
--css 'html > head > script' --piece url='attr[src]', which look more straightforward than the latter two...

from cascadia.

suntong avatar suntong commented on August 15, 2024 1

it only output the first item; I suspect that this is where the current bug exists

Yep, I thought so too. Thanks for investating.

That way the --piece syntax would sort of end up being much closer to just standard CSS selector usage.

That's a brilliant idea. I've been searching for CSS Attribute selectors before many times, but all conclusings had been that it is not supported. Yeah, I fully agree with you that we should use the CSS Attribute selectors syntax instead.

from cascadia.

suntong avatar suntong commented on August 15, 2024 1

Thanks for the great input. Make sense.
I might not be able to look into in 2~3 weeks, but I will...

from cascadia.

0xdevalias avatar 0xdevalias commented on August 15, 2024 1

Another source of 'prior art', xq just implemented this recently; you can see their approach on this comment (and in the commits linked later in the timeline):

from cascadia.

suntong avatar suntong commented on August 15, 2024

I don't think CSS selection can select base on attributes though.

Do you want to give another example instead?

from cascadia.

suntong avatar suntong commented on August 15, 2024

If you just want to raw CSS selection value,

The work around is to use one -p after -c:

  -p, --piece           sub CSS selectors within -css to split that block up into pieces
                        format: PieceName=[RAW:]selector_string

e.g.,

$ echo '<p><a href="/home">some url</a></p>' | cascadia -i -o -c 'p' -p 'ATag=a'
ATag
some url

$ echo '<p><a href="/home">some url</a></p>' | cascadia -i -o -c 'p' -p 'ATag=RAW:a'
ATag
<a href="/home">some url</a>

It's not perfect, but cascadia was not built to be perfect but for quick hacks.

from cascadia.

hoshsadiq avatar hoshsadiq commented on August 15, 2024

Perhaps an alternative would be adding a --no-header option for --piece, that doesn't print out the headers?

from cascadia.

suntong avatar suntong commented on August 15, 2024

Gotya. Lee me think it over...

from cascadia.

hoshsadiq avatar hoshsadiq commented on August 15, 2024

Happy to raise a PR if needed

from cascadia.

suntong avatar suntong commented on August 15, 2024

Uhm... I gave a careful thought about it, and was about to turn it down, because my believe in the "Unix philosophy" -- Write programs that

  • do one thing and do it well
  • to work together
  • to handle text streams

I.e., the attributes selection is impossible with CSS thus should be out of the scope of cascadia; and --no-header can be simply solved by sed 1d:

$ echo '<p><a href="/home">some url</a></p>' | cascadia -i -o -c 'p' -p 'ATag=a' | sed 1d
some url

I.e., it'd against my principle to complicate my code base for something so simple to solve. However, I do see a need in your request, and you offered a PR. So I'm OK with the PR, iff you are doing the correct way -- i.e., starting from cascadia.yaml, and use wireframe for the code gen.

If that doesn't deter you, then go ahead. :-)

Thx for your contribution.

from cascadia.

suntong avatar suntong commented on August 15, 2024

closed for lack of activity, please reopen if there is more input...

from cascadia.

0xdevalias avatar 0xdevalias commented on August 15, 2024

Just stumbled onto this issue as I was attempting to extract all of the src attributes for the script tags from a page, and it sounds like that should be possible with --piece, yet it didn't work for me:

curl --silent https://example.com/somethingwithscripts | cascadia --in --out --css 'html > head > script' --piece url='attr[src]'

I also tried various variations of this none of which seemed to work:

curl --silent https://example.com/somethingwithscripts | cascadia --in --out --css 'html > head > script' --piece url='attr[src]:script'

curl --silent https://example.com/somethingwithscripts | cascadia --in --out --css 'html > head > script' --piece url='attr[src]:*'

curl --silent https://example.com/somethingwithscripts | cascadia --in --out --css 'html > head' --piece url='attr[src]:script'

Yet with pup, it was not only a far simpler syntax, but also just worked the first time I tried it:

⇒ curl --silent https://example.com/somethingwithscripts | pup 'html > head > script attr{src}'

/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js
/_next/static/chunks/webpack-1eeae5c7aedde088.js
/_next/static/chunks/framework-e23f030857e925d4.js
/_next/static/chunks/main-35ce5aa6f4f7a906.js
/_next/static/chunks/pages/_app-0df67bf7d9e6e732.js
/_next/static/chunks/1f110208-cda4026aba1898fb.js
/_next/static/chunks/012ff928-bcfa62e3ac82441c.js
/_next/static/chunks/68a27ff6-a453fd719d5bf767.js
/_next/static/chunks/bd26816a-981e1ddc27b37cc6.js
/_next/static/chunks/692-a1e5a91f2cd1f1d0.js
/_next/static/chunks/434-6f11f27f549beeab.js
/_next/static/chunks/97-536ee884c863676e.js
/_next/static/chunks/734-30d5c00c7bdf11c1.js
/_next/static/chunks/pages/share/%5B%5B...shareParams%5D%5D-44619ef92ec8f3b5.js
/_next/static/a3Jc7aP-UMfeR9s4-iLvW/_buildManifest.js
/_next/static/a3Jc7aP-UMfeR9s4-iLvW/_ssgManifest.js

from cascadia.

0xdevalias avatar 0xdevalias commented on August 15, 2024

Hmm... please try giving a minimal reproducible example. Else I won't be able to guess what the problem is.

<html>
<head>
  <script src="foo.js"></script>
  <script src="bar.js"></script>
  <script src="baz.js"></script>
</head>
</html>
⇒ pbpaste | cascadia --in --out --css 'html > head > script' --piece url='attr[src]'
url




⇒ pbpaste | cascadia --in --out --css 'html > head > script' --piece url='attr[src]:script'
url




⇒ pbpaste | cascadia --in --out --css 'html > head' --piece url='attr[src]:script'
url
foo.js

Expected outcome:

url
foo.js
bar.js
baz.js

Also, you didn't say which version you're using. Are you using the v1.2.7?

Version 1.2.7 built on 2023-01-08

from cascadia.

0xdevalias avatar 0xdevalias commented on August 15, 2024

Hope the final interface would be:
--css 'html > head > script' --piece url='attr[src]', which look more straightforward than the latter two...

While it would sort of be a breaking change, in a sense it feels like the attr prefix is sort of redundant as well, particularly given in the implementation in cascadia it's getting used in a seperate context (--piece) rather than the main --css (whereas with pup it sort of needs to differentiate itself since it all appears in the one query)

The initial thing I would have expected/tried was just being able to use standard CSS attribute selector syntax within --piece, eg [src] (or perhaps also script[src] if you wanted to support the full syntax there as well):

That way the --piece syntax would sort of end up being much closer to just standard CSS selector usage. And I could do something like:

⇒ pbpaste | cascadia --in --out --css 'html > head > script' --piece url='[src]'

⇒ pbpaste | cascadia --in --out --css 'html > head' --piece url='script[src]'

# etc

Skimming the codebase, the following areas look like they would be relevant/related to these changes:

from cascadia.

0xdevalias avatar 0xdevalias commented on August 15, 2024

@suntong @himcc I haven't looked too deeply at the code/tested this assumption, but from a quick skim I noticed that what appears to be the section handling --piece hardcodes operating on cssa[0]:

cascadia/cascadia_main.go

Lines 165 to 200 in 4b56cde

} else {
// have sub CSS selectors within -css -- block selection mode
// fmt.Printf("%v\n", piece)
// https://godoc.org/github.com/PuerkitoBio/goquery
// for debug
//doc, err := goquery.NewDocumentFromReader(strings.NewReader(testhtml))
doc, err := goquery.NewDocumentFromReader(bi)
abortOn("Input", err)
// Print csv headers
for _, key := range piece.Keys {
fmt.Fprintf(bw, "%s%s", key, deli)
}
fmt.Fprintf(bw, "\n")
// Process each item block
doc.Find(cssa[0]).Each(func(index int, item *goquery.Selection) {
//fmt.Printf("] #%d: %s\n", index, item.Text())
for _, key := range piece.Keys {
//fmt.Printf("] %s: %s\n", key, piece.Values[key])
switch piece.OutputStyles[key] {
case OutputStyleRAW:
html.Render(bw, item.Find(piece.Values[key]).Get(0))
fmt.Fprintf(bw, deli)
case OutputStyleATTR:
fmt.Fprintf(bw, "%s%s",
item.Find(piece.Values[key]).AttrOr(piece.AttrName[key], ""), deli)
case OutputStyleTEXT:
fmt.Fprintf(bw, "%s%s",
item.Find(piece.Values[key]).Contents().Text(), deli)
}
}
fmt.Fprintf(bw, "\n")
})
}

Whereas the seemingly non---piece code uses for _, css := range cssa

cascadia/cascadia_main.go

Lines 127 to 165 in 4b56cde

if len(piece.Values) == 0 {
// no sub CSS selectors -- none-block selection mode
if textOut {
doc, err := goquery.NewDocumentFromReader(bi)
abortOn("Input", err)
for _, css := range cssa {
// Process each item block
doc.Find(css).Each(func(index int, item *goquery.Selection) {
//fmt.Printf("] #%d: %s\n", index, item.Text())
if textRaw {
fmt.Fprintf(bw, "%s%s",
item.Text(), deli)
} else {
fmt.Fprintf(bw, "%s%s",
strings.TrimSpace(item.Text()), deli)
}
fmt.Fprintf(bw, "\n")
})
}
} else {
doc, err := html.Parse(bi)
abortOn("Input", err)
for _, css := range cssa {
c, err := cascadia.Compile(css)
abortOn("CSS Selector string "+css, err)
// https://godoc.org/github.com/andybalholm/cascadia
ns := c.MatchAll(doc)
if !beQuiet {
fmt.Fprintf(os.Stderr, "%d elements for '%s':\n", len(ns), css)
}
for _, n := range ns {
html.Render(bw, n)
fmt.Fprintf(bw, "\n")
}
}
}
} else {

Based on that, and the fact that in my above pbpaste | cascadia --in --out --css 'html > head' --piece url='attr[src]:script' example (Ref) it only output the first item; I suspect that this is where the current bug exists; and my assumption is that it would be fixed by also using code similar to for _, css := range cssa here too instead of cssa[0]

(Though personally, I still think it would be useful to simplify and improve the --piece syntax as well if we can)

from cascadia.

suntong avatar suntong commented on August 15, 2024

Ops,

I've been searching for CSS Attribute selectors before many times, but all conclusings had been that it is not supported

I didn't look into the url closely, but having check it out again just now, I found that the a[title] in CSS Attribute selectors means <a> elements with a title attribute, IE, how to select the <a> elements, while here we need a syntax to return attributes, which is not supported by CSS selectors (still).

Will think it over...

from cascadia.

0xdevalias avatar 0xdevalias commented on August 15, 2024

I didn't look into the url closely, but having check it out again just now, I found that the a[title] in CSS Attribute selectors means <a> elements with a title attribute, IE, how to select the <a> elements, while here we need a syntax to return attributes, which is not supported by CSS selectors (still).

@suntong Yup, that is how it works in normal CSS selector usage, and is how it would need to work (and just how it does work I believe) in the --query part of cascadia; but what I was proposing above is that we could leverage the same familiar semantics of that syntax, but when used within --piece, we could have it output the actual attribute being described.

So basically:

  • --query:
    • [foo] would return all elements that have an attribute named foo
    • a[foo] would return all elements that have an a attribute named foo
    • etc
  • --piece
    • [foo] would return the attribute named foo from all elements that have it
    • a[foo] would return the attribute named foo from all a elements that have it
    • etc

So then combined, you could use these like I described above:

That way the --piece syntax would sort of end up being much closer to just standard CSS selector usage. And I could do something like:

⇒ pbpaste | cascadia --in --out --css 'html > head > script' --piece url='[src]'

⇒ pbpaste | cascadia --in --out --css 'html > head' --piece url='script[src]'

# etc

Originally posted by @0xdevalias in #3 (comment)

from cascadia.

0xdevalias avatar 0xdevalias commented on August 15, 2024

Another alternative would be to re-consider how --piece works in terms of the 'prior art' from pup, and how it has various 'Display Functions' (as they call them)

They have:

  • text{}
  • attr{attrkey}
  • json{}

Personally I don't think json{} makes a lot of sense here (you could just run the HTML through an XML -> JSON tool).

I'm pretty sure the --text mode already covers what text{} would do.

So it basically seems to just leave the attr{attrkey} version.

Though personally I like the idea of keeping the CSS selector --query as 'pure selectors' (unlike pup also adding in the 'Display Functions') there.

from cascadia.

0xdevalias avatar 0xdevalias commented on August 15, 2024

Thanks for the great input. Make sense.
I might not be able to look into in 2~3 weeks, but I will...

@suntong No worries, I appreciate it :)

from cascadia.

0xdevalias avatar 0xdevalias commented on August 15, 2024

Awesome! Will have to check it out once a new release is made! Thanks :)

from cascadia.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.