shinnonoir / twitterwebsearch Goto Github PK
View Code? Open in Web Editor NEW(BROKEN, help wanted)
Home Page: https://github.com/ShinNoNoir/twitterwebsearch/issues/12
License: Other
(BROKEN, help wanted)
Home Page: https://github.com/ShinNoNoir/twitterwebsearch/issues/12
License: Other
Currently, the twitterwebsearch.searcher
module offers a low level search
function. This function returns the raw HTML representing the search results page. If a query retrieves too many tweets, the browser process may crash due to a very large HTML page (it's easily 20x the JSON representation of the tweets).
It'd be an enhancement if the module would offer a search
function that returns tweets rather than HTML and would internally understand queries, such that it could split a query into subqueries by splitting the since:
and until:
range.
Using the query #HSVFCB lang:de since:2015-08-01 until:2016-01-25
(all tweets for a certain football game) twitterwebsearch finds 180 tweets. However Twitter's own web interface returns >1000 tweets.
Any idea why?
Sample query: hallo from:myemoji since:2016-03-23
HTML:
<div class="js-tweet-text-container">
<p class="TweetTextSize js-tweet-text tweet-text" lang="en" data-aria-label-part="0">
<img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/1f497.png" draggable="false" alt="๐" title="Growing Heart" aria-label="Emoji: Growing Heart"><strong>hallo</strong> <a href="https://t.co/hQDqlwkuQC" rel="nofollow" dir="ltr" data-expanded-url="http://iemoji.com/tw/FXaKj2" class="twitter-timeline-link" target="_blank" title="http://iemoji.com/tw/FXaKj2" ><span class="tco-ellipsis"></span><span class="invisible">http://</span><span class="js-display-url">iemoji.com/tw/FXaKj2</span><span class="invisible"></span><span class="tco-ellipsis"><span class="invisible"> </span></span></a></p>
</div>
JSON result:
[
{
"lang": "en",
"permalink": "/myemoji/status/713009622550876160",
"user_id": "246161779",
"name": "myEmoji",
"timestamp": 1458829715.0,
"tweet_text": "hallo http://iemoji.com/tw/FXaKj2\u00a0",
"tweet_id": "713009622550876160",
"urls": [
"http://iemoji.com/tw/FXaKj2"
],
"mentions": [],
"retweet_count": 0,
"favorite_count": 0,
"screen_name": "myemoji"
}
]
Hello,
I have run twitterwebsearch recently. Tweets are returned, but seem to be invalid JSON, and can't be decoded by json.loads().
Would you consider looking into fixing this issue?
Below is a summary of the error:
searcher.py in search(query)
65 def search(query):
66 for tweet in download_tweets(search=query): <-------------
67 yield tweet
searcher.py in download_tweets(search, profile, sleep)
42 response = requests.get(url_more.format(term=urllib.quote_plus(term), max_position=min_position), headers={'User-agent': USER_AGENT}).text
43 try:
44 response_dict = json.loads(response) <-------------
45 except:
46 import date time
lib/python2.7/json/init.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
337 parse_int is None and parse_float is None and
338 parse_constant is None and object_pairs_hook is None and not kw):
339 return _default_decoder.decode(s) <-------------
340 if cls is None:
341 cls = JSONDecoder
python2.7/json/decoder.pyc in decode(self, s, _w)
363 """
364 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) <-------------
365 end = _w(s, end).end()
366 if end != len(s):
python2.7/json/decoder.pyc in raw_decode(self, s, idx)
380 obj, end = self.scan_once(s, idx)
381 except StopIteration:
382 raise ValueError("No JSON object could be decoded") <-------------
383 return obj, end
ValueError: No JSON object could be decoded
Currently, the code relies on send_keys
. Sometimes not all keypresses are sent, resulting in an incorrect query.
TODO: Investigate a potential better approach to entering the query. Perhaps by executing JavaScript?
If the search results happen to include a censored tweet (e.g. 648188210074513412) an exception gets thrown
Traceback (most recent call last):
File "quickstart.py", line 23, in <module>
main()
File "quickstart.py", line 13, in main
tweets = list(tweets) # convert generator into list
File "twitterwebsearch/twitterwebsearch/searcher.py", line 64, in search
for tweet in download_tweets(search=query):
File "twitterwebsearch/twitterwebsearch/searcher.py", line 53, in download_tweets
for tweet in parse_search_results(response_dict['items_html'].encode('utf8')):
File "twitterwebsearch/twitterwebsearch/parser.py", line 68, in parse_search_results
tweet = parse_tweet_tag(tag)
File "twitterwebsearch/twitterwebsearch/parser.py", line 24, in parse_tweet_tag
lang = tweet_body_tag['lang']
TypeError: 'NoneType' object has no attribute '__getitem__'
because parse_tweet_tag
in parser.py
is looking for an element with the tweet-text
class, which censored tweets apparently don't have.
Regular tweet:
<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="de">foo</p>
Censored tweet:
<p class="js-tweet-text">This Tweet from @FlashScoreAT has been withheld in response to a report from the copyright holder. <a href="https://support.twitter.com/articles/15795" target="_blank">Learn more</a>
</p>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.