shinnonoir / twitterwebsearch Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 11.0 27 KB

(BROKEN, help wanted)

Home Page: https://github.com/ShinNoNoir/twitterwebsearch/issues/12

License: Other

Python 100.00%

twitterwebsearch's People

Contributors

Stargazers

Watchers

Forkers

ahurriyetoglu wolph skylarker linwoodc3 preetham-raju f2face namanm1994 nicolabc ptyler21 suvhotta

twitterwebsearch's Issues

Automatic, high level tweet crawling with automatic chunking

Currently, the twitterwebsearch.searcher module offers a low level search function. This function returns the raw HTML representing the search results page. If a query retrieves too many tweets, the browser process may crash due to a very large HTML page (it's easily 20x the JSON representation of the tweets).

It'd be an enhancement if the module would offer a search function that returns tweets rather than HTML and would internally understand queries, such that it could split a query into subqueries by splitting the since: and until: range.

twitterwebsearch finds far less tweets than exist

Using the query #HSVFCB lang:de since:2015-08-01 until:2016-01-25 (all tweets for a certain football game) twitterwebsearch finds 180 tweets. However Twitter's own web interface returns >1000 tweets.

Any idea why?

Emojis missing

Sample query: hallo from:myemoji since:2016-03-23

HTML:

<div class="js-tweet-text-container">
  <p class="TweetTextSize  js-tweet-text tweet-text" lang="en" data-aria-label-part="0">
  <img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/1f497.png" draggable="false" alt="💗" title="Growing Heart" aria-label="Emoji: Growing Heart"><strong>hallo</strong> <a href="https://t.co/hQDqlwkuQC" rel="nofollow" dir="ltr" data-expanded-url="http://iemoji.com/tw/FXaKj2" class="twitter-timeline-link" target="_blank" title="http://iemoji.com/tw/FXaKj2" ><span class="tco-ellipsis"></span><span class="invisible">http://</span><span class="js-display-url">iemoji.com/tw/FXaKj2</span><span class="invisible"></span><span class="tco-ellipsis"><span class="invisible">&nbsp;</span></span></a></p>
</div>

JSON result:

[                                                                                                                                                      
  {                                                                                                                                                    
    "lang": "en",                                                                                                                                      
    "permalink": "/myemoji/status/713009622550876160",                                                                                                 
    "user_id": "246161779",                                                                                                                            
    "name": "myEmoji",                                                                                                                                 
    "timestamp": 1458829715.0,                                                                                                                         
    "tweet_text": "hallo http://iemoji.com/tw/FXaKj2\u00a0",                                                                                           
    "tweet_id": "713009622550876160",                                                                                                                  
    "urls": [                                                                                                                                          
      "http://iemoji.com/tw/FXaKj2"                                                                                                                    
    ],                                                                                                                                                 
    "mentions": [],                                                                                                                                    
    "retweet_count": 0,                                                                                                                                
    "favorite_count": 0,                                                                                                                               
    "screen_name": "myemoji"                                                                                                                           
  }                                                                                                                                                    
]

"ValueError: No JSON object could be decoded"

Hello,

I have run twitterwebsearch recently. Tweets are returned, but seem to be invalid JSON, and can't be decoded by json.loads().

Would you consider looking into fixing this issue?

Below is a summary of the error:

searcher.py in search(query)

 65 def search(query):
 66     for tweet in download_tweets(search=query):             <-------------
 67         yield tweet

searcher.py in download_tweets(search, profile, sleep)

 42         response = requests.get(url_more.format(term=urllib.quote_plus(term), max_position=min_position), headers={'User-agent': USER_AGENT}).text
 43         try:
 44             response_dict = json.loads(response)            <-------------
 45         except:
 46             import date time

lib/python2.7/json/init.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)

337             parse_int is None and parse_float is None and
338             parse_constant is None and object_pairs_hook is None and not kw):
339         return _default_decoder.decode(s)            <-------------
340     if cls is None:
341         cls = JSONDecoder

python2.7/json/decoder.pyc in decode(self, s, _w)

363         """
364         obj, end = self.raw_decode(s, idx=_w(s, 0).end())            <-------------
365         end = _w(s, end).end()
366         if end != len(s):

python2.7/json/decoder.pyc in raw_decode(self, s, idx)

380             obj, end = self.scan_once(s, idx)
381         except StopIteration:
382             raise ValueError("No JSON object could be decoded")            <-------------
383         return obj, end

ValueError: No JSON object could be decoded

Sometimes an incomplete query is entered in the search box, resulting in incorrect search results

Currently, the code relies on send_keys. Sometimes not all keypresses are sent, resulting in an incorrect query.

TODO: Investigate a potential better approach to entering the query. Perhaps by executing JavaScript?

censored tweet causes an exception

If the search results happen to include a censored tweet (e.g. 648188210074513412) an exception gets thrown

Traceback (most recent call last):
  File "quickstart.py", line 23, in <module>
    main()
  File "quickstart.py", line 13, in main
    tweets = list(tweets) # convert generator into list
  File "twitterwebsearch/twitterwebsearch/searcher.py", line 64, in search
    for tweet in download_tweets(search=query):
  File "twitterwebsearch/twitterwebsearch/searcher.py", line 53, in download_tweets
    for tweet in parse_search_results(response_dict['items_html'].encode('utf8')):
  File "twitterwebsearch/twitterwebsearch/parser.py", line 68, in parse_search_results
    tweet = parse_tweet_tag(tag)
  File "twitterwebsearch/twitterwebsearch/parser.py", line 24, in parse_tweet_tag
    lang = tweet_body_tag['lang']
TypeError: 'NoneType' object has no attribute '__getitem__'

because parse_tweet_tag in parser.py is looking for an element with the tweet-text class, which censored tweets apparently don't have.

Regular tweet:

<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="de">foo</p>

Censored tweet:

<p class="js-tweet-text">This Tweet from @FlashScoreAT has been withheld in response to a report from the copyright holder. <a href="https://support.twitter.com/articles/15795" target="_blank">Learn more</a>
</p>

shinnonoir / twitterwebsearch Goto Github PK

twitterwebsearch's People

Contributors

Stargazers

Watchers

Forkers

twitterwebsearch's Issues

Automatic, high level tweet crawling with automatic chunking

twitterwebsearch finds far less tweets than exist

Emojis missing

"ValueError: No JSON object could be decoded"

Sometimes an incomplete query is entered in the search box, resulting in incorrect search results

censored tweet causes an exception

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent