Code Monkey home page Code Monkey logo

twitterwebsearch's People

Contributors

egbertbouman avatar shinnonoir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

twitterwebsearch's Issues

Automatic, high level tweet crawling with automatic chunking

Currently, the twitterwebsearch.searcher module offers a low level search function. This function returns the raw HTML representing the search results page. If a query retrieves too many tweets, the browser process may crash due to a very large HTML page (it's easily 20x the JSON representation of the tweets).

It'd be an enhancement if the module would offer a search function that returns tweets rather than HTML and would internally understand queries, such that it could split a query into subqueries by splitting the since: and until: range.

twitterwebsearch finds far less tweets than exist

Using the query #HSVFCB lang:de since:2015-08-01 until:2016-01-25 (all tweets for a certain football game) twitterwebsearch finds 180 tweets. However Twitter's own web interface returns >1000 tweets.

Any idea why?

Emojis missing

Sample query: hallo from:myemoji since:2016-03-23
image

HTML:

<div class="js-tweet-text-container">
  <p class="TweetTextSize  js-tweet-text tweet-text" lang="en" data-aria-label-part="0">
  <img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/1f497.png" draggable="false" alt="๐Ÿ’—" title="Growing Heart" aria-label="Emoji: Growing Heart"><strong>hallo</strong> <a href="https://t.co/hQDqlwkuQC" rel="nofollow" dir="ltr" data-expanded-url="http://iemoji.com/tw/FXaKj2" class="twitter-timeline-link" target="_blank" title="http://iemoji.com/tw/FXaKj2" ><span class="tco-ellipsis"></span><span class="invisible">http://</span><span class="js-display-url">iemoji.com/tw/FXaKj2</span><span class="invisible"></span><span class="tco-ellipsis"><span class="invisible">&nbsp;</span></span></a></p>
</div>

JSON result:

[                                                                                                                                                      
  {                                                                                                                                                    
    "lang": "en",                                                                                                                                      
    "permalink": "/myemoji/status/713009622550876160",                                                                                                 
    "user_id": "246161779",                                                                                                                            
    "name": "myEmoji",                                                                                                                                 
    "timestamp": 1458829715.0,                                                                                                                         
    "tweet_text": "hallo http://iemoji.com/tw/FXaKj2\u00a0",                                                                                           
    "tweet_id": "713009622550876160",                                                                                                                  
    "urls": [                                                                                                                                          
      "http://iemoji.com/tw/FXaKj2"                                                                                                                    
    ],                                                                                                                                                 
    "mentions": [],                                                                                                                                    
    "retweet_count": 0,                                                                                                                                
    "favorite_count": 0,                                                                                                                               
    "screen_name": "myemoji"                                                                                                                           
  }                                                                                                                                                    
]

"ValueError: No JSON object could be decoded"

Hello,

I have run twitterwebsearch recently. Tweets are returned, but seem to be invalid JSON, and can't be decoded by json.loads().

Would you consider looking into fixing this issue?

Below is a summary of the error:

searcher.py in search(query)

 65 def search(query):
 66     for tweet in download_tweets(search=query):             <-------------
 67         yield tweet

searcher.py in download_tweets(search, profile, sleep)

 42         response = requests.get(url_more.format(term=urllib.quote_plus(term), max_position=min_position), headers={'User-agent': USER_AGENT}).text
 43         try:
 44             response_dict = json.loads(response)            <-------------
 45         except:
 46             import date time

lib/python2.7/json/init.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)

337             parse_int is None and parse_float is None and
338             parse_constant is None and object_pairs_hook is None and not kw):
339         return _default_decoder.decode(s)            <-------------
340     if cls is None:
341         cls = JSONDecoder

python2.7/json/decoder.pyc in decode(self, s, _w)

363         """
364         obj, end = self.raw_decode(s, idx=_w(s, 0).end())            <-------------
365         end = _w(s, end).end()
366         if end != len(s):

python2.7/json/decoder.pyc in raw_decode(self, s, idx)

380             obj, end = self.scan_once(s, idx)
381         except StopIteration:
382             raise ValueError("No JSON object could be decoded")            <-------------
383         return obj, end

ValueError: No JSON object could be decoded

censored tweet causes an exception

If the search results happen to include a censored tweet (e.g. 648188210074513412) an exception gets thrown

Traceback (most recent call last):
  File "quickstart.py", line 23, in <module>
    main()
  File "quickstart.py", line 13, in main
    tweets = list(tweets) # convert generator into list
  File "twitterwebsearch/twitterwebsearch/searcher.py", line 64, in search
    for tweet in download_tweets(search=query):
  File "twitterwebsearch/twitterwebsearch/searcher.py", line 53, in download_tweets
    for tweet in parse_search_results(response_dict['items_html'].encode('utf8')):
  File "twitterwebsearch/twitterwebsearch/parser.py", line 68, in parse_search_results
    tweet = parse_tweet_tag(tag)
  File "twitterwebsearch/twitterwebsearch/parser.py", line 24, in parse_tweet_tag
    lang = tweet_body_tag['lang']
TypeError: 'NoneType' object has no attribute '__getitem__'

because parse_tweet_tag in parser.py is looking for an element with the tweet-text class, which censored tweets apparently don't have.

Regular tweet:

<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="de">foo</p>

Censored tweet:

<p class="js-tweet-text">This Tweet from @FlashScoreAT has been withheld in response to a report from the copyright holder. <a href="https://support.twitter.com/articles/15795" target="_blank">Learn more</a>
</p>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.