Rolling TODO list thread No. 1. This thread has been archived. Continue discussion in

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Interestingly, I just noticed <div class="snippet-clipboard-content notranslate po

As for Buku, on a fairly vanilla Windows 10 install, there's no <code class="notransla

Neither cmd nor PowerShell supports Unicode <p dir="aut

TODO list,about jarun/googler

Comments (47)

zmwangx commented on May 22, 2024

@jarun Your TODO list. We could link it from README if you like it. Anything I'm missing at the moment?

from googler.

jarun commented on May 22, 2024

Linking from readme is very much necessary. This covers for the time... now that we already have a deb package. I'll take care of the doc stuff.

from googler.

zmwangx commented on May 22, 2024

Right, deb package... It's a WIP, so it should still be listed. Added to the list with a link to the PR.

from googler.

zmwangx commented on May 22, 2024

I think we could safely cross out Windows installation, because I just tried and it doesn't even work...

For one, apparently fcntl and termios shouldn't work on non-Unix or Unix-like systems, and we rely on those to get terminal size. On Python 3.3+ there's os.get_terminal_size (which does work on Windows — I've written a cross-platform progress bar module based on that), but we need to support 2.7, so tough luck.

Then, readline is not available. Shouldn't be too surprising either. Goodbye line editing.

At that point I shutdown my VM out of frustration, so I don't know if there are other problems...

Seeing that no one ever reported, it's safe to assume that we have no Windows users at all.

from googler.

jarun commented on May 22, 2024

I updated the readme accordingly. :)
If someone really wants it, he/she can bypass these small problems. We didn't have readline at some point and I loved googler nonetheless. It saved me a lot of time even then. ;)

Many thanks for pointing it out. We don't wanna misguide our users. If you do have a licensed win VM can you try out buku as well? It uses readline, but anything else?

from googler.

zmwangx commented on May 22, 2024

Will try. I do have licenses for all Windows releases since XP... But do you realize Microsoft offer official VM images that don't require a license and are typically good for 60 days (or maybe 30) from initial boot? https://developer.microsoft.com/en-us/microsoft-edge/tools/vms/linux/. (Looks like they have taken down XP images, but I have archived XP download URLs too, here: https://gist.github.com/zmwangx/e728c56f428bc703c6f6)

from googler.

zmwangx commented on May 22, 2024

Okay, so I tried googler on Windows again without fcntl, termios and readline.

Neither cmd nor PowerShell supports ANSI escape sequences, so colors don't work, you get raw sequences with ^[ displayed as a boxed question mark.
Neither cmd nor PowerShell supports Unicode (what year is it, again?), at least not by default, so even with -C you fail most of the time, because U+2013 en-dash and U+2014 em-dash are everywhere.

from googler.

zmwangx commented on May 22, 2024

Interestingly, I just noticed

googler -n 3 google
 1  Google
https://www.google.com/
Search the world's information, including webpages, images, videos and more. Google has many special features to help you find
exactly what you're looking ...

 2  Google (@google) | Twitter
https://twitter.com/google?ref_src=twsrc^google|twcamp^serp|twgr^author

That https://twitter.com/google?ref_src=twsrc^google|twcamp^serp|twgr^author is very weird...

The link in source is https://twitter.com/google?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor. We probably should not percent-decode, ~~since the decoded string isn't a valid URL~~?

from googler.

zmwangx commented on May 22, 2024

As for Buku, on a fairly vanilla Windows 10 install, there's no HOME environment variable.

from googler.

jarun commented on May 22, 2024

Neither cmd nor PowerShell supports Unicode

I donno what to say.

We probably should not percent-decode, since the decoded string isn't a valid URL?

Should be fine. But we need to check whether it is valid or not too.

As for Buku, on a fairly vanilla Windows 10 install, there's no HOME environment variable.

True, and when it's not available it should create the DB file in the same dir.

from googler.

zmwangx commented on May 22, 2024

We probably should not percent-decode, since the decoded string isn't a valid URL?

Should be fine.

Sorry, you mean "percent-decode should be fine", or "not percent-decode should be fine"?

I was wrong in saying https://twitter.com/google?ref_src=twsrc^google|twcamp^serp|twgr^author is not a valid URL. It actually is, because I just checked RFC 3986 again, and ^ is not reserved. Which makes it even more problematic: what we're printing is an entirely different URL that doesn't work (try it, you'll get HTTP 400).

True, and when it's not available it should create the DB file in the same dir.

When it's not available, you get an exception when you try to os.path.join(os.getenv('HOME'), ...), because os.getenv returns None. A reliable solution might be os.path.expanduser('~'), but I don't think it's worth it to adapt just for Windows.

from googler.

jarun commented on May 22, 2024

Which makes it even more problematic: what we're printing is an entirely different URL that doesn't work (try it, you'll get HTTP 400).

Yes, I checked it the first time. It fails. I meant that we need to fix it. Show it as it comes to us.

When it's not available, you get an exception

I will check it out. Seems like I need to download a Windows image ;). Thanks for the VM image links.

from googler.

zmwangx commented on May 22, 2024

Yes, I checked it the first time. It fails. I meant that we need to fix it.

The fix is trivial enough:

diff --git a/googler b/googler
index f7e0e5c..ba6f387 100755
--- a/googler
+++ b/googler
@@ -39,7 +39,6 @@ if sys.version_info > (3,):
     from urllib.parse import (
         urljoin,
         quote_plus as url_quote_plus,
-        unquote as url_unquote,
     )
     from http.client import HTTPSConnection

@@ -50,7 +49,6 @@ else:
     import HTMLParser
     from urllib import (
         quote_plus as url_quote_plus,
-        unquote as url_unquote,
     )
     from urlparse import urljoin
     from httplib import HTTPSConnection
@@ -159,8 +157,7 @@ class GoogleParser(HTMLParser.HTMLParser):
             if self.url != "":
                 if self.url.find("://", 0, 12) >= 0:
                     index = len(self.results) + 1
-                    self.results.append(Result(index, self.title,
-                                               url_unquote(self.url),
+                    self.results.append(Result(index, self.title, self.url,
                                                self.text))
                 else:
                     skipped += 1

Basically, just don't unquote.

However, I'm not sure if it will have side effects. unquote was introduced in ff58e20, but the commit message is very brief and I'm not quite sure what problem it fixes. Double quote is not a reserved character (again per RFC 3986), and webbrowser.open('https://example.com/"') works just fine. Can you give an example of not unquoting leading to problems?

Interestingly enough, although https://twitter.com/google?ref_src=twsrc^google|twcamp^serp|twgr^author is a valid URL by itself, webbrowser.open does something smartass to encode it correctly (or wrongly, I would say, and happen to land on the expected page). No luck when I use the same URL in Chrome address bar.

from googler.

jarun commented on May 22, 2024

However, I'm not sure if it will have side effects.

Yes, I tried the same just now. Works. The original bug was:

$ ./googler -n1 hello world
 1  "Hello, World!" program - Wikipedia, the free encyclopedia 
https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
A "Hello, World!" program is a computer program that outputs "Hello, World!" on a display device, often standard output. Being a very simple program in most ...

Note the %22 for ". I am all ears for your opinion here.

or wrongly, I would say, and happen to land on the expected page

Doesn't work for me when I try to open result 2.

from googler.

zmwangx commented on May 22, 2024

https://en.wikipedia.org/wiki/%22Hello,_World!%22_program

That URL works for me... What's the problem?

Doesn't work for me when I try to open result 2.

Then it's OS X doing the smartass thing. webbrowser.open is an AppleScript wrapper on OS X.

from googler.

jarun commented on May 22, 2024

That URL works for me... What's the problem?

Trying to be perfect if possible ;). I'll add https://github.com/jarun/Buku/blob/master/buku#L796.

from googler.

zmwangx commented on May 22, 2024

Trying to be perfect if possible ;)

I would say a working implementation trumps a pretty but broken one...

I'll add https://github.com/jarun/Buku/blob/master/buku#L796.

No strong objection.

By the way, I'm be out for a hour or two. Won't be able to reply until I get back.

from googler.

zmwangx commented on May 22, 2024

No strong objection.

Wait, no. On second thought " isn't equivalent to %22 per RFC 3986 (correct me if I'm wrong). It works with Wikipedia, but it doesn't necessarily work everywhere. I don't think it's the right thing to do.

(I'll try to write a proof-of-concept web app that handle " and %22 differently when I get back.)

from googler.

jarun commented on May 22, 2024

No strong objection.

OK then.

By the way, I'm be out for a hour or two. Won't be able to reply until I get back.

Enjoy your day!

from googler.

jarun commented on May 22, 2024

Wait, no. On second thought...

Sorry, I pushed it before seeing this. Feel free to check out a better way of handling this.

from googler.

zmwangx commented on May 22, 2024

Back to RFCs.

RFC 3986 §2.2:

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

RFC 3986 §2.3:

      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

RFC 3986 §6.2.2.2:

   The percent-encoding mechanism (Section 2.1) is a frequent source of
   variance among otherwise identical URIs.  In addition to the case
   normalization issue noted above, some URI producers percent-encode
   octets that do not require percent-encoding, resulting in URIs that
   are equivalent to their non-encoded counterparts.  These URIs should
   be normalized by decoding any percent-encoded octet that corresponds
   to an unreserved character, as described in Section 2.3.

RFC 3987 §2.1:

   IRIs are defined similarly to URIs in [RFC3986], but the class of
   unreserved characters is extended by adding the characters of the UCS
   (Universal Character Set, [ISO10646]) beyond U+007F, subject to the
   limitations given in the syntax rules below and in section 6.1.

Therefore, U+0022 Quotation Mark isn't even an allowed character in either URI or IRI. Which is very obvious because URI/IRIs should be embeddable in HTML, and HTML attributes are wrapped in double quotes. Many (if not most) modern browsers are smart enough to automatically quote the quotation mark when it appears in an actually-invalid URI, but there's no guarantee that this will work in all browsers. Erring on the safe side, I would not do this. (Not to mention " is ~~totally random~~ just one reasonable character to replace; other people might want to have other characters decoded.)

from googler.

zmwangx commented on May 22, 2024

To elaborate a bit on my last point: if you support ", then it's a perfectly reasonable request to also support %3C (<) and %3E (>), both of which are not valid URI characters, again obviously for interoperability with HTML. And maybe other characters too.

from googler.

jarun commented on May 22, 2024

OK OK. Consider it gone. 💃

from googler.

jarun commented on May 22, 2024

BTW, if you have a collection of soothing traditional Cantonese music (lyrics-less is what I'm looking for), do share.

from googler.

zmwangx commented on May 22, 2024

BTW, if you have a collection of soothing traditional Cantonese music (lyrics-less is what I'm looking for), do share.

Missed that... Unfortunately I don't 😉 My early training in music leans on the (Western) classical side, and I mostly listen to Chinese/South Korean pop music these days; either case, no lyrics-less traditional Cantonese music.

from googler.

zmwangx commented on May 22, 2024

I would like to add support for sitelinks.

I'll implement this sooner or later.

from googler.

jarun commented on May 22, 2024

Awesome! Please add to the list.

from googler.

zmwangx commented on May 22, 2024

See chatroom for some questions.

from googler.

zmwangx commented on May 22, 2024

I think googler has accumulated enough complexity (1270 lines, close to 1000 if you take out comments) to the point that changes to one part of the program may break another part subtly, and since our test script only tests the core functionality, and worse, only watches for obvious failures, we risk introducing regressions. e159a44 is an example, although that's an embarrassingly simple one easily caught by static analysis.

Therefore, I'm thinking about unit tests. But in order to write unit tests, we first need to make googler importable. Which means wrapping up bare code into functional units and have a main that is only run when __name__ == '__main__'. We also need to reduce the reliance on globals, which is easy in some cases, e.g., GoogleParser which is the linchpin of googler only uses news, which could easily be an init parameter; and slightly harder in other cases, but still very doable. (Also, reducing reliance doesn't necessarily mean absolutely no globals — debug can certainly be a global, and so do colorize and such that doesn't make much of a difference for testing purposes.)

Once we have the code contained, we can stop relying on Google leniently allowing us a few hundred queries. We can easily build up a couple thousand or more responses to a wide range of queries over a day or two, then do whatever we want with those queries. (And we can update the response repertoire once in a while; the test script can also do a few realtime queries to make sure there's no breaking change on Google's part.) The interactive parts are certainly somewhat harder to test, but I'm sure there are ways to stub things out and test them given a little bit more thought.

This will be a pretty significant undertaking, and I don't think either of us will have time to do this soon, but just want to put this idea out for scrutiny.

from googler.

jarun commented on May 22, 2024

It will be a nice improvement but we can't do this ourselves. We should add this in ToDo. Please link to your comment above.

from googler.

zmwangx commented on May 22, 2024

I don't feel too strongly about this, but here's an idea: since the colors chosen by us don't always look nice in all color schemes (honestly it doesn't even look so good with my slightly localized Solarized Dark:

), we should offer a way to customize it. An option, --colors, and an env var to make up for the lack of config file (one executable file + no config file is great and we're not gonna break that, but reading from env should be okay). As for the actual format, I think we can take a page from either GNU LS_COLORS (ls/dircolors) or BSD LSCOLORS. I would prefer BSD because it's shorter and more straightforward, but dircolors supports 256 colors, and it may be more familiar to some or even more people.

from googler.

zmwangx commented on May 22, 2024

By the way, isn't that a cute screenshot? 😉

from googler.

jarun commented on May 22, 2024

Your hold on generating beautiful images/videos is unparalleled. BTW, are you into photography?

from googler.

zmwangx commented on May 22, 2024

BTW, are you into photography?

Not at all...

from googler.

jarun commented on May 22, 2024

Try it

from googler.

zmwangx commented on May 22, 2024

As a stay-at-home type of person and selfie hater, the main channels are closed. I do occasionally take a shot when I see something beautiful though.

More on topic, what do you say about colors? Please reply at your leisure.

from googler.

jarun commented on May 22, 2024

Do you mean colour presets? That would be a valuable addition. But if we want users to fiddle around with it (custom colours), we will be concentrating more on colours than other features.

A set of defined presets would be great.

from googler.

jarun commented on May 22, 2024

BTW, we need a new asciinema with the new prompt. Please add the prompt help as well. Your latest change makes it way more organized.

from googler.

jarun commented on May 22, 2024

I am planning a new release next weekend. Please let me know if your are fine. Can we pull-in preset colours by that time?

from googler.

zmwangx commented on May 22, 2024

Do you mean colour presets?

No, because implementing color presets is actually more work for us. With BSD-style LSCOLORS, the user only needs to supply a five-letter string (which is inherently a five-element list), representing:

Index color;
Title color;
URL color;
Metadata/abstract color;
Prompt color;

and that's all. Our default is also a five-letter string. Then we use a tiny color map (BSD has 16 colors + default, we should add reverse video too, so 18), and bam, done.

In order to have presets, you basically need to do all of the above, AND you need to be a good designer, AND even then you can't make everyone happy. (I know one one has complained thus far, just like I didn't, but maybe it's because it's too small an issue; but it's always nice to have the customizability there.) I'm not a designer, although I am somewhat into visual design, so there's it.

from googler.

zmwangx commented on May 22, 2024

we need a new asciinema with the new prompt.

I'll do that prior to the release.

I am planning a new release next weekend.

No problem.

from googler.

jarun commented on May 22, 2024

because color presets is actually more work for us

I get it now. The five-letter string makes sense. I'm good.

from googler.

zmwangx commented on May 22, 2024

Added to the top, will do it when I have time, probably during the weekend or even before that when I don't feel like getting other work done...

from googler.

jarun commented on May 22, 2024

No hurry :)

from googler.

zmwangx commented on May 22, 2024

By the way, what do you think about rolling the todo list thread? With

> document.getElementsByClassName('timeline-comment-wrapper').length
45

comments and

> document.body.offsetHeight
13054

vertical pixels, this one is getting kind of a pain to scroll. Can we start a new thread (copy over top post while getting rid of archived items) once in a while?

from googler.

jarun commented on May 22, 2024

sure!

from googler.

zmwangx commented on May 22, 2024

Roll. New thread at #83.

from googler.

TODO list about googler HOT 47 CLOSED

Comments (47)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent