Code Monkey home page Code Monkey logo

search-tweets-python's Introduction

Looking for the Twitter API v2 version? Check out the new 'v2' branch.

Twitter v2 search endpoints now include a 'counts' endpoint that returns time-series totals of matching Tweets.

Python Twitter Search API

This project serves as a wrapper for the Twitter premium and enterprise search APIs, providing a command-line utility and a Python library. Pretty docs can be seen here.

Features

  • Supports 30-day Search and Full Archive Search (not the standard Search API at this time).
  • Command-line utility is pipeable to other tools (e.g., jq).
  • Automatically handles pagination of search results with specifiable limits
  • Delivers a stream of data to the user for low in-memory requirements
  • Handles enterprise and premium authentication methods
  • Flexible usage within a python program
  • Compatible with our group's Tweet Parser for rapid extraction of relevant data fields from each tweet payload
  • Supports the Search Counts endpoint, which can reduce API call usage and provide rapid insights if you only need Tweet volumes and not Tweet payloads

Installation

The searchtweets library is on Pypi:

pip install searchtweets

Or you can install the development version locally via

git clone https://github.com/twitterdev/search-tweets-python
cd search-tweets-python
pip install -e .

Credential Handling

The premium and enterprise Search APIs use different authentication methods and we attempt to provide a seamless way to handle authentication for all customers. We know credentials can be tricky or annoying - please read this in its entirety.

Premium clients will require the bearer_token and endpoint fields; Enterprise clients require username, password, and endpoint. If you do not specify the account_type, we attempt to discern the account type and declare a warning about this behavior.

For premium search products, we are using app-only authentication and the bearer tokens are not delivered with an expiration time. You can provide either: - your application key and secret (the library will handle bearer-token authentication) - a bearer token that you get yourself

Many developers might find providing your application key and secret more straightforward and letting this library manage your bearer token generation for you. Please see here for an overview of the premium authentication method.

We support both YAML-file based methods and environment variables for storing credentials, and provide flexible handling with sensible defaults.

YAML method

For premium customers, the simplest credential file should look like this:

search_tweets_api:
  account_type: premium
  endpoint: <FULL_URL_OF_ENDPOINT>
  consumer_key: <CONSUMER_KEY>
  consumer_secret: <CONSUMER_SECRET>

For enterprise customers, the simplest credential file should look like this:

search_tweets_api:
  account_type: enterprise
  endpoint: <FULL_URL_OF_ENDPOINT>
  username: <USERNAME>
  password: <PW>

By default, this library expects this file at "~/.twitter_keys.yaml", but you can pass the relevant location as needed, either with the --credential-file flag for the command-line app or as demonstrated below in a Python program.

Both above examples require no special command-line arguments or in-program arguments. The credential parsing methods, unless otherwise specified, will look for a YAML key called search_tweets_api.

For developers who have multiple endpoints and/or search products, you can keep all credentials in the same file and specify specific keys to use. --credential-file-key specifies this behavior in the command line app. An example:

search_tweets_30_day_dev:
  account_type: premium
  endpoint: <FULL_URL_OF_ENDPOINT>
  consumer_key: <KEY>
  consumer_secret: <SECRET>
  (optional) bearer_token: <TOKEN>


search_tweets_30_day_prod:
  account_type: premium
  endpoint: <FULL_URL_OF_ENDPOINT>
  bearer_token: <TOKEN>

search_tweets_fullarchive_dev:
  account_type: premium
  endpoint: <FULL_URL_OF_ENDPOINT>
  bearer_token: <TOKEN>

search_tweets_fullarchive_prod:
  account_type: premium
  endpoint: <FULL_URL_OF_ENDPOINT>
  bearer_token: <TOKEN>

Environment Variables

If you want or need to pass credentials via environment variables, you can set the appropriate variables for your product of the following:

export SEARCHTWEETS_ENDPOINT=
export SEARCHTWEETS_USERNAME=
export SEARCHTWEETS_PASSWORD=
export SEARCHTWEETS_BEARER_TOKEN=
export SEARCHTWEETS_ACCOUNT_TYPE=
export SEARCHTWEETS_CONSUMER_KEY=
export SEARCHTWEETS_CONSUMER_SECRET=

The load_credentials function will attempt to find these variables if it cannot load fields from the YAML file, and it will overwrite any credentials from the YAML file that are present as environment variables if they have been parsed. This behavior can be changed by setting the load_credentials parameter env_overwrite to False.

The following cells demonstrates credential handling in the Python library.

from searchtweets import load_credentials
load_credentials(filename="./search_tweets_creds_example.yaml",
                 yaml_key="search_tweets_ent_example",
                 env_overwrite=False)
{'username': '<MY_USERNAME>',
 'password': '<MY_PASSWORD>',
 'endpoint': '<MY_ENDPOINT>'}
load_credentials(filename="./search_tweets_creds_example.yaml",
                 yaml_key="search_tweets_premium_example",
                 env_overwrite=False)
{'bearer_token': '<A_VERY_LONG_MAGIC_STRING>',
 'endpoint': 'https://api.twitter.com/1.1/tweets/search/30day/dev.json',
 'extra_headers_dict': None}

Environment Variable Overrides

If we set our environment variables, the program will look for them regardless of a YAML file's validity or existence.

import os
os.environ["SEARCHTWEETS_USERNAME"] = "<ENV_USERNAME>"
os.environ["SEARCHTWEETS_PASSWORD"] = "<ENV_PW>"
os.environ["SEARCHTWEETS_ENDPOINT"] = "<https://endpoint>"

load_credentials(filename="nothing_here.yaml", yaml_key="no_key_here")
cannot read file nothing_here.yaml
Error parsing YAML file; searching for valid environment variables
{'username': '<ENV_USERNAME>',
 'password': '<ENV_PW>',
 'endpoint': '<https://endpoint>'}

Command-line app

the flags:

  • --credential-file <FILENAME>
  • --credential-file-key <KEY>
  • --env-overwrite

are used to control credential behavior from the command-line app.


Using the Command Line Application =================================

The library includes an application, search_tweets.py, that provides rapid access to Tweets. When you use pip to install this package, search_tweets.py is installed globally. The file is located in the tools/ directory for those who want to run it locally.

Note that the --results-per-call flag specifies an argument to the API ( maxResults, results returned per CALL), not as a hard max to number of results returned from this program. The argument --max-results defines the maximum number of results to return from a given call. All examples assume that your credentials are set up correctly in the default location - .twitter_keys.yaml or in environment variables.

Stream json results to stdout without saving

search_tweets.py \
  --max-results 1000 \
  --results-per-call 100 \
  --filter-rule "beyonce has:hashtags" \
  --print-stream

Stream json results to stdout and save to a file

search_tweets.py \
  --max-results 1000 \
  --results-per-call 100 \
  --filter-rule "beyonce has:hashtags" \
  --filename-prefix beyonce_geo \
  --print-stream

Save to file without output

search_tweets.py \
  --max-results 100 \
  --results-per-call 100 \
  --filter-rule "beyonce has:hashtags" \
  --filename-prefix beyonce_geo \
  --no-print-stream

One or more custom headers can be specified from the command line, using the --extra-headers argument and a JSON-formatted string representing a dictionary of extra headers:

search_tweets.py \
  --filter-rule "beyonce has:hashtags" \
  --extra-headers '{"<MY_HEADER_KEY>":"<MY_HEADER_VALUE>"}'

Options can be passed via a configuration file (either ini or YAML). Example files can be found in the tools/api_config_example.config or ./tools/api_yaml_example.yaml files, which might look like this:

[search_rules]
from_date = 2017-06-01
to_date = 2017-09-01
pt_rule = beyonce has:geo

[search_params]
results_per_call = 500
max_results = 500

[output_params]
save_file = True
filename_prefix = beyonce
results_per_file = 10000000

Or this:

search_rules:
    from-date: 2017-06-01
    to-date: 2017-09-01 01:01
    pt-rule: kanye

search_params:
    results-per-call: 500
    max-results: 500

output_params:
    save_file: True
    filename_prefix: kanye
    results_per_file: 10000000

Custom headers can be specified in a config file, under a specific credentials key:

search_tweets_api:
  account_type: premium
  endpoint: <FULL_URL_OF_ENDPOINT>
  username: <USERNAME>
  password: <PW>
  extra_headers:
    <MY_HEADER_KEY>: <MY_HEADER_VALUE>

When using a config file in conjunction with the command-line utility, you need to specify your config file via the --config-file parameter. Additional command-line arguments will either be added to the config file args or overwrite the config file args if both are specified and present.

Example:

search_tweets.py \
  --config-file myapiconfig.config \
  --no-print-stream

Full options are listed below:

$ search_tweets.py -h
usage: search_tweets.py [-h] [--credential-file CREDENTIAL_FILE]
                      [--credential-file-key CREDENTIAL_YAML_KEY]
                      [--env-overwrite ENV_OVERWRITE]
                      [--config-file CONFIG_FILENAME]
                      [--account-type {premium,enterprise}]
                      [--count-bucket COUNT_BUCKET]
                      [--start-datetime FROM_DATE] [--end-datetime TO_DATE]
                      [--filter-rule PT_RULE]
                      [--results-per-call RESULTS_PER_CALL]
                      [--max-results MAX_RESULTS] [--max-pages MAX_PAGES]
                      [--results-per-file RESULTS_PER_FILE]
                      [--filename-prefix FILENAME_PREFIX]
                      [--no-print-stream] [--print-stream]
                      [--extra-headers EXTRA_HEADERS] [--debug]

optional arguments:
  -h, --help            show this help message and exit
  --credential-file CREDENTIAL_FILE
                        Location of the yaml file used to hold your
                        credentials.
  --credential-file-key CREDENTIAL_YAML_KEY
                        the key in the credential file used for this session's
                        credentials. Defaults to search_tweets_api
  --env-overwrite ENV_OVERWRITE
                        Overwrite YAML-parsed credentials with any set
                        environment variables. See API docs or readme for
                        details.
  --config-file CONFIG_FILENAME
                        configuration file with all parameters. Far, easier to
                        use than the command-line args version., If a valid
                        file is found, all args will be populated, from there.
                        Remaining command-line args, will overrule args found
                        in the config, file.
  --account-type {premium,enterprise}
                        The account type you are using
  --count-bucket COUNT_BUCKET
                        Set this to make a 'counts' request. Bucket size for counts endpoint. Options:, day, hour,
                        minute.
  --start-datetime FROM_DATE
                        Start of datetime window, format 'YYYY-mm-DDTHH:MM'
                        (default: -30 days)
  --end-datetime TO_DATE
                        End of datetime window, format 'YYYY-mm-DDTHH:MM'
                        (default: most recent date)
  --filter-rule PT_RULE
                        PowerTrack filter rule (See: http://support.gnip.com/c
                        ustomer/portal/articles/901152-powertrack-operators)
  --results-per-call RESULTS_PER_CALL
                        Number of results to return per call (default 100; max
                        500) - corresponds to 'maxResults' in the API. If making a 'counts' request with '--count-bucket', this parameter is ignored.
  --max-results MAX_RESULTS
                        Maximum number of Tweets or Counts to return for this
                        session (defaults to 500)
  --max-pages MAX_PAGES
                        Maximum number of pages/API calls to use for this
                        session.
  --results-per-file RESULTS_PER_FILE
                        Maximum tweets to save per file.
  --filename-prefix FILENAME_PREFIX
                        prefix for the filename where tweet json data will be
                        stored.
  --no-print-stream     disable print streaming
  --print-stream        Print tweet stream to stdout 
  --extra-headers EXTRA_HEADERS
                        JSON-formatted str representing a dict of additional
                        request headers
  --debug               print all info and warning messages

Using the Twitter Search APIs' Python Wrapper

Working with the API within a Python program is straightforward both for Premium and Enterprise clients.

We'll assume that credentials are in the default location, ~/.twitter_keys.yaml.

from searchtweets import ResultStream, gen_rule_payload, load_credentials

Enterprise setup

enterprise_search_args = load_credentials("~/.twitter_keys.yaml",
                                          yaml_key="search_tweets_enterprise",
                                          env_overwrite=False)

Premium Setup

premium_search_args = load_credentials("~/.twitter_keys.yaml",
                                       yaml_key="search_tweets_premium",
                                       env_overwrite=False)

There is a function that formats search API rules into valid json queries called gen_rule_payload. It has sensible defaults, such as pulling more Tweets per call than the default 100 (but note that a sandbox environment can only have a max of 100 here, so if you get errors, please check this) not including dates. Discussing the finer points of generating search rules is out of scope for these examples; I encourage you to see the docs to learn the nuances within, but for now let's see what a rule looks like.

rule = gen_rule_payload("beyonce", results_per_call=100) # testing with a sandbox account
print(rule)
{"query":"beyonce","maxResults":100}

This rule will match tweets that have the text beyonce in them.

From this point, there are two ways to interact with the API. There is a quick method to collect smaller amounts of Tweets to memory that requires less thought and knowledge, and interaction with the ResultStream object which will be introduced later.

Fast Way

We'll use the search_args variable to power the configuration point for the API. The object also takes a valid PowerTrack rule and has options to cutoff search when hitting limits on both number of Tweets and API calls.

We'll be using the collect_results function, which has three parameters.

  • rule: a valid PowerTrack rule, referenced earlier
  • max_results: as the API handles pagination, it will stop collecting when we get to this number
  • result_stream_args: configuration args that we've already specified.

For the remaining examples, please change the args to either premium or enterprise depending on your usage.

Let's see how it goes:

from searchtweets import collect_results
tweets = collect_results(rule,
                         max_results=100,
                         result_stream_args=enterprise_search_args) # change this if you need to

By default, Tweet payloads are lazily parsed into a Tweet object. An overwhelming number of Tweet attributes are made available directly, as such:

[print(tweet.all_text, end='\n\n') for tweet in tweets[0:10]];
Jay-Z &amp; Beyoncé sat across from us at dinner tonight and, at one point, I made eye contact with Beyoncé. My limbs turned to jello and I can no longer form a coherent sentence. I have seen the eyes of the lord.

Beyoncé and it isn't close. https://t.co/UdOU9oUtuW

As you could guess.. Signs by Beyoncé will always be my shit.

When Beyoncé adopts a dog 🙌🏾 https://t.co/U571HyLG4F

Hold up, you can't just do that to Beyoncé
https://t.co/3p14DocGqA

Why y'all keep using Rihanna and Beyoncé gifs to promote the show when y'all let Bey lose the same award she deserved 3 times and let Rihanna leave with nothing but the clothes on her back? https://t.co/w38QpH0wma

30) anybody tell you that you look like Beyoncé https://t.co/Vo4Z7bfSCi

Mi Beyoncé favorita https://t.co/f9Jp600l2B
Beyoncé necesita ver esto. Que diosa @TiniStoessel 🔥🔥🔥 https://t.co/gadVJbehQZ

Joanne Pearce Is now playing IF I WAS A BOY - BEYONCE.mp3 by !

I'm trynna see beyoncé's finsta before I die
[print(tweet.created_at_datetime) for tweet in tweets[0:10]];
2018-01-17 00:08:50
2018-01-17 00:08:49
2018-01-17 00:08:44
2018-01-17 00:08:42
2018-01-17 00:08:42
2018-01-17 00:08:42
2018-01-17 00:08:40
2018-01-17 00:08:38
2018-01-17 00:08:37
2018-01-17 00:08:37
[print(tweet.generator.get("name")) for tweet in tweets[0:10]];
Twitter for iPhone
Twitter for iPhone
Twitter for iPhone
Twitter for iPhone
Twitter for iPhone
Twitter for iPhone
Twitter for Android
Twitter for iPhone
Airtime Pro
Twitter for iPhone

Voila, we have some Tweets. For interactive environments and other cases where you don't care about collecting your data in a single load or don't need to operate on the stream of Tweets or counts directly, I recommend using this convenience function.

Working with the ResultStream

The ResultStream object will be powered by the search_args, and takes the rules and other configuration parameters, including a hard stop on number of pages to limit your API call usage.

rs = ResultStream(rule_payload=rule,
                  max_results=500,
                  max_pages=1,
                  **premium_search_args)

print(rs)
ResultStream: 
 {
    "username":null,
    "endpoint":"https:\/\/api.twitter.com\/1.1\/tweets\/search\/30day\/dev.json",
    "rule_payload":{
        "query":"beyonce",
        "maxResults":100
    },
    "tweetify":true,
    "max_results":500
}

There is a function, .stream, that seamlessly handles requests and pagination for a given query. It returns a generator, and to grab our 500 Tweets that mention beyonce we can do this:

tweets = list(rs.stream())

Tweets are lazily parsed using our Tweet Parser, so tweet data is very easily extractable.

# using unidecode to prevent emoji/accents printing 
[print(tweet.all_text) for tweet in tweets[0:10]];
gente socorro kkkkkkkkkk BEYONCE https://t.co/kJ9zubvKuf
Jay-Z &amp; Beyoncé sat across from us at dinner tonight and, at one point, I made eye contact with Beyoncé. My limbs turned to jello and I can no longer form a coherent sentence. I have seen the eyes of the lord.
Beyoncé and it isn't close. https://t.co/UdOU9oUtuW
As you could guess.. Signs by Beyoncé will always be my shit.
When Beyoncé adopts a dog 🙌🏾 https://t.co/U571HyLG4F
Hold up, you can't just do that to Beyoncé
https://t.co/3p14DocGqA
Why y'all keep using Rihanna and Beyoncé gifs to promote the show when y'all let Bey lose the same award she deserved 3 times and let Rihanna leave with nothing but the clothes on her back? https://t.co/w38QpH0wma
30) anybody tell you that you look like Beyoncé https://t.co/Vo4Z7bfSCi
Mi Beyoncé favorita https://t.co/f9Jp600l2B
Beyoncé necesita ver esto. Que diosa @TiniStoessel 🔥🔥🔥 https://t.co/gadVJbehQZ
Joanne Pearce Is now playing IF I WAS A BOY - BEYONCE.mp3 by !

Counts Endpoint

We can also use the Search API Counts endpoint to get counts of Tweets that match our rule. Each request will return up to 30 days of results, and each count request can be done on a minutely, hourly, or daily basis. The underlying ResultStream object will handle converting your endpoint to the count endpoint, and you have to specify the count_bucket argument when making a rule to use it.

The process is very similar to grabbing Tweets, but has some minor differences.

Caveat - premium sandbox environments do NOT have access to the Search API counts endpoint.

count_rule = gen_rule_payload("beyonce", count_bucket="day")

counts = collect_results(count_rule, result_stream_args=enterprise_search_args)

Our results are pretty straightforward and can be rapidly used.

counts
[{'count': 366, 'timePeriod': '201801170000'},
 {'count': 44580, 'timePeriod': '201801160000'},
 {'count': 61932, 'timePeriod': '201801150000'},
 {'count': 59678, 'timePeriod': '201801140000'},
 {'count': 44014, 'timePeriod': '201801130000'},
 {'count': 46607, 'timePeriod': '201801120000'},
 {'count': 41523, 'timePeriod': '201801110000'},
 {'count': 47056, 'timePeriod': '201801100000'},
 {'count': 65506, 'timePeriod': '201801090000'},
 {'count': 95251, 'timePeriod': '201801080000'},
 {'count': 162883, 'timePeriod': '201801070000'},
 {'count': 106344, 'timePeriod': '201801060000'},
 {'count': 93542, 'timePeriod': '201801050000'},
 {'count': 110415, 'timePeriod': '201801040000'},
 {'count': 127523, 'timePeriod': '201801030000'},
 {'count': 131952, 'timePeriod': '201801020000'},
 {'count': 176157, 'timePeriod': '201801010000'},
 {'count': 57229, 'timePeriod': '201712310000'},
 {'count': 72277, 'timePeriod': '201712300000'},
 {'count': 72051, 'timePeriod': '201712290000'},
 {'count': 76371, 'timePeriod': '201712280000'},
 {'count': 61578, 'timePeriod': '201712270000'},
 {'count': 55118, 'timePeriod': '201712260000'},
 {'count': 59115, 'timePeriod': '201712250000'},
 {'count': 106219, 'timePeriod': '201712240000'},
 {'count': 114732, 'timePeriod': '201712230000'},
 {'count': 73327, 'timePeriod': '201712220000'},
 {'count': 89171, 'timePeriod': '201712210000'},
 {'count': 192381, 'timePeriod': '201712200000'},
 {'count': 85554, 'timePeriod': '201712190000'},
 {'count': 57829, 'timePeriod': '201712180000'}]

Note that this will only work with the full archive search option, which is available to my account only via the enterprise options. Full archive search will likely require a different endpoint or access method; please see your developer console for details.

Let's make a new rule and pass it dates this time.

gen_rule_payload takes timestamps of the following forms:

  • YYYYmmDDHHMM
  • YYYY-mm-DD (which will convert to midnight UTC (00:00)
  • YYYY-mm-DD HH:MM
  • YYYY-mm-DDTHH:MM

Note - all Tweets are stored in UTC time.

rule = gen_rule_payload("from:jack",
                        from_date="2017-09-01", #UTC 2017-09-01 00:00
                        to_date="2017-10-30",#UTC 2017-10-30 00:00
                        results_per_call=500)
print(rule)
{"query":"from:jack","maxResults":500,"toDate":"201710300000","fromDate":"201709010000"}
tweets = collect_results(rule, max_results=500, result_stream_args=enterprise_search_args)
[print(tweet.all_text) for tweet in tweets[0:10]];
More clarity on our private information policy and enforcement. Working to build as much direct context into the product too https://t.co/IrwBexPrBA
To provide more clarity on our private information policy, we’ve added specific examples of what is/is not a violation and insight into what we need to remove this type of content from the service. https://t.co/NGx5hh2tTQ
Launching violent groups and hateful images/symbols policy on November 22nd https://t.co/NaWuBPxyO5
We will now launch our policies on violent groups and hateful imagery and hate symbols on Nov 22. During the development process, we received valuable feedback that we’re implementing before these are published and enforced. See more on our policy development process here 👇 https://t.co/wx3EeH39BI
@WillStick @lizkelley Happy birthday Liz!
Off-boarding advertising from all accounts owned by Russia Today (RT) and Sputnik.

We’re donating all projected earnings ($1.9mm) to support external research into the use of Twitter in elections, including use of malicious automation and misinformation. https://t.co/zIxfqqXCZr
@TMFJMo @anthonynoto Thank you
@gasca @stratechery @Lefsetz letter
@gasca @stratechery Bridgewater’s Daily Observations
Yup!!!! ❤️❤️❤️❤️ #davechappelle https://t.co/ybSGNrQpYF
@ndimichino Sometimes
Setting up at @CampFlogGnaw https://t.co/nVq8QjkKsf
rule = gen_rule_payload("from:jack",
                        from_date="2017-09-20",
                        to_date="2017-10-30",
                        count_bucket="day",
                        results_per_call=500)
print(rule)
{"query":"from:jack","toDate":"201710300000","fromDate":"201709200000","bucket":"day"}
counts = collect_results(rule, max_results=500, result_stream_args=enterprise_search_args)
[print(c) for c in counts];
{'timePeriod': '201710290000', 'count': 0}
{'timePeriod': '201710280000', 'count': 0}
{'timePeriod': '201710270000', 'count': 3}
{'timePeriod': '201710260000', 'count': 6}
{'timePeriod': '201710250000', 'count': 4}
{'timePeriod': '201710240000', 'count': 4}
{'timePeriod': '201710230000', 'count': 0}
{'timePeriod': '201710220000', 'count': 0}
{'timePeriod': '201710210000', 'count': 3}
{'timePeriod': '201710200000', 'count': 2}
{'timePeriod': '201710190000', 'count': 1}
{'timePeriod': '201710180000', 'count': 6}
{'timePeriod': '201710170000', 'count': 2}
{'timePeriod': '201710160000', 'count': 2}
{'timePeriod': '201710150000', 'count': 1}
{'timePeriod': '201710140000', 'count': 64}
{'timePeriod': '201710130000', 'count': 3}
{'timePeriod': '201710120000', 'count': 4}
{'timePeriod': '201710110000', 'count': 8}
{'timePeriod': '201710100000', 'count': 4}
{'timePeriod': '201710090000', 'count': 1}
{'timePeriod': '201710080000', 'count': 0}
{'timePeriod': '201710070000', 'count': 0}
{'timePeriod': '201710060000', 'count': 1}
{'timePeriod': '201710050000', 'count': 3}
{'timePeriod': '201710040000', 'count': 5}
{'timePeriod': '201710030000', 'count': 8}
{'timePeriod': '201710020000', 'count': 5}
{'timePeriod': '201710010000', 'count': 0}
{'timePeriod': '201709300000', 'count': 0}
{'timePeriod': '201709290000', 'count': 0}
{'timePeriod': '201709280000', 'count': 9}
{'timePeriod': '201709270000', 'count': 41}
{'timePeriod': '201709260000', 'count': 13}
{'timePeriod': '201709250000', 'count': 6}
{'timePeriod': '201709240000', 'count': 7}
{'timePeriod': '201709230000', 'count': 3}
{'timePeriod': '201709220000', 'count': 0}
{'timePeriod': '201709210000', 'count': 1}
{'timePeriod': '201709200000', 'count': 7}

Contributing

Any contributions should follow the following pattern:

  1. Make a feature or bugfix branch, e.g., git checkout -b my_new_feature
  2. Make your changes in that branch
  3. Ensure you bump the version number in searchtweets/_version.py to reflect your changes. We use Semantic Versioning, so non-breaking enhancements should increment the minor version, e.g., 1.5.0 -> 1.6.0, and bugfixes will increment the last version, 1.6.0 -> 1.6.1.
  4. Create a pull request

After the pull request process is accepted, package maintainers will handle building documentation and distribution to Pypi.

For reference, distributing to Pypi is accomplished by the following commands, ran from the root directory in the repo:

python setup.py bdist_wheel
python setup.py sdist
twine upload dist/*

How to build the documentation:

Building the documentation requires a few Sphinx packages to build the webpages:

pip install sphinx
pip install sphinx_bootstrap_theme
pip install sphinxcontrib-napoleon

Then (once your changes are committed to master) you should be able to run the documentation-generating bash script and follow the instructions:

bash build_sphinx_docs.sh master searchtweets

Note that this README is also generated, and so after any README changes you'll need to re-build the README (you need pandoc version 2.1+ for this) and commit the result:

bash make_readme.sh

search-tweets-python's People

Contributors

42b avatar andypiper avatar aureliaspecker avatar fionapigott avatar jeffakolb avatar jimmoffitt avatar jrmontag avatar lfsando avatar oihamza-zz avatar v11ncent avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

search-tweets-python's Issues

Consolidate API setup boilerplate into helper function

I'm noticing a pattern in the examples (and my own code) where I create the requisite YAML file -
containing my endpoint, account, and user creds - and then I write a line or two of JSON/dict parsing to create a *_args object which goes into the collect_results() function. Here's an example from some code I'm working on right now:

with open(".twitter_keys.yaml") as f:
    creds = yaml.load(f)

search_endpoint = creds["search_api"]["endpoint"]
count_endpoint = change_to_count_endpoint(search_endpoint)

search_args = {"username": creds["search_api"]["username"],
               "password": creds["search_api"]["password"],
               #"bearer_token": creds["search_api"]["bearer_token"],
               "endpoint": search_endpoint,
               }
count_args = {"username": creds["search_api"]["username"],
              "password": creds["search_api"]["password"],
              # "bearer_token": creds["search_api"]["bearer_token"],
              "endpoint": count_endpoint,
             }    

rule = gen_rule_payload('cannes', from_date='2017-05-17', to_date='2017-05-29')

tweets = collect_results(rule, max_results=1000, result_stream_args=search_args)

It might be more clear to the user if we make strict expectations about the YAML contents and then used a helper function to hide the requisite YAML parsing and manipulation. I imagine replacing the above with the following:

with open(".twitter_keys.yaml") as f:
    creds = yaml.load(f)
    # this new dict has all the relevant keys for collect_results()
    api_args = get_api_args(creds)

rule = gen_rule_payload('cannes', from_date='2017-05-17', to_date='2017-05-29')

# and collect_results() could use a simple endpoint switch
tweets = collect_results(rule, max_results=1000, result_stream_args=api_args, endpoint='tweets')

where the get_api_args() handles the YAML parsing, the string manipulations for the different endpoints, and conditional logic for the existence of optional yaml keys.

This streamlines the user experience (fewer lines of code), but doesn't add much abstraction. Thoughts? If it sounds useful, I'm happy to take a first stab at it.

save as json

Hi, is there a way to save all queried tweets as json files first, before parsing them into Twitter Parser?

truncated tweet does not contain `exteded_tweet`

I am using Premium API and according to #62 (comment) I should be able to access extended_tweet, but it is not the case. Any ideas?

Here is one example of retrieved tweet via Premium Search:

{'created_at': 'Tue Nov 26 14:44:42 +0000 2019', 'id': 1199338194476490755, 'id_str': '1199338194476490755', 'text': '@Twins Unfortunate that such a great looking uniform has been defaced with the mark of 🇺🇸 Cop hating, Communist lov… https://t.co/E39DoIaQUP', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'Twins', 'name': 'Minnesota Twins', 'id': 39397148, 'id_str': '39397148', 'indices': [0, 6]}], 'urls': [{'url': 'https://t.co/E39DoIaQUP', 'expanded_url': 'https://twitter.com/i/web/status/1199338194476490755', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}]}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': 1198979679136473088, 'in_reply_to_status_id_str': '1198979679136473088', 'in_reply_to_user_id': 39397148, 'in_reply_to_user_id_str': '39397148', 'in_reply_to_screen_name': 'Twins', 'user': {'id': 1358231077, 'id_str': '1358231077', 'name': 'Fishin🎣', 'screen_name': 'InDa906Eh', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 33, 'friends_count': 306, 'listed_count': 0, 'created_at': 'Wed Apr 17 00:51:18 +0000 2013', 'favourites_count': 1714, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': False, 'statuses_count': 1111, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': '131516', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme14/bg.gif', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme14/bg.gif', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1181605743608381442/lGiYEUrc_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1181605743608381442/lGiYEUrc_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1358231077/1573519171', 'profile_link_color': '009999', 'profile_sidebar_border_color': 'EEEEEE', 'profile_sidebar_fill_color': 'EFEFEF', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none'}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 0, 'favorite_count': 0, 'favorited': False, 'retweeted': False, 'lang': 'en'}

warning ... YAMLLoadWarning ... deprecated

first time user here. I get this RED warning message:

~/env/.../python3.5/site-packages/searchtweets/credentials.py:34:
YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, 
as the default Loader is unsafe. 
Please read https://msg.pyyaml.org/load for full details.
  search_creds = yaml.load(f)[yaml_key]

when executing this code:

    premium_search_args = searchtweets.load_credentials("twitter_credentials.yaml",
                                       yaml_key="tweets_search_fullarchive",
                                       env_overwrite=False)

Full Archive Search - handles Next token - larger than 500 results.

Hi,

Thanks for this wonderful package. I'm using the enterprise full archive search feature for our company and was wondering how I grab tweets that are more than 500 tweets. With requests of more than 500 tweets, there is a next token provided that will have the next 500 tweets right?

So how do I get the results of the full archive search, not just the first 500 tweets. Does this gen_rule_payload() and collect_results() functions give back all of the tweets from the full archive search, not just the first 500?

Thanks for the help!

Below is the full archive example script provided in the docs:

rule = gen_rule_payload("from:jack", 
    from_date="2017-09-01", #UTC 2017-09-01 00:0 
    to_date="2017-10-30",#UTC 2017-10-30 00:00
    results_per_call=500)

print(rule)

{"query":"from:jack","maxResults":500,"toDate":"201710300000","fromDate":"201709010000"}

tweets = collect_results(rule, max_results=500, result_stream_args=enterprise_search_args)

n_requests not consistent

I'm trying to retrieve data from the full-archive API. The following query returns 8 tweets correctly but 4 requests (rs.n_requests) are executed.

rs = ResultStream(rule_payload=gen_rule_payload("china from:realdonaldtrump", from_date='2015-06-01', to_date='2015-09-01', results_per_call=100), max_results=100, **premium_search_args)

data = list(rs.stream())

Since one request should return up to 100 tweets for sandbox accounts, why does this query consume 4 requests.

UPDATE: Sorry, this is no bug of the lib. This issue is discussed here https://twittercommunity.com/t/using-more-requests-than-expected-with-searchtweets-module-and-fullarchive-endpoint/106722/2

Unclear on use patterns: PATH script or local file in repo?

If I understand correctly, when the library is pip-installed, the current setup.py file copies the executable search_tweets.py into the bin/ dir of the relevant python environment. In my experience, this is so the user can run e.g. (env)$ search_tweets.py from any location and the stand-alone script will still be on the $PATH. However the current search_tweets.py doesn't have the #! line so it leads to some unexpected errors - in both my case and this SO post, the resulting output comes from ImageMagik (of all places...).

In the README, the user is instructed to run the local, repo file as (env)$ python search_tweets.py from the tools/ directory. This requires the user to have the repo cloned locally.

I think it would be helpful to be more clear about which is the recommended way for the user to run the main search_tweets.py script (not when imported as a library).

My preference would be to enable the file to run by keeping it in the setup.py script and adding the appropriate shebang. Then, we could remove the language about using the repo version, thus removing the expectation that the user has downloaded or cloned the repo. But I'm happy to hear more about the relative trade-offs of the approaches here.

Scraped tweets are the same for every request

I'm using the premium sandbox API for scraping tweets and I can only scrape 100 tweets for each request, so if I need to scrape 500 tweets on one day I will need to do 5 requests. The problem is that when I scrape 100 tweets for each request it scrapes the same 100 tweets each time, so how can I make sure that the 100 tweets I scrape for a request is different from the 100 tweets for each other requests

HTTP Rate Limit Error on Premium Full Archive Search?

I have 13 similar queries, 9 of them succeed, however 4 of them error due to HTTP Rate Limiting.

Full Error Output:

ERROR:searchtweets.result_stream:HTTP Error code: 429: Exceeded rate limit
ERROR:searchtweets.result_stream:Rule payload: {
      'fromDate': '201708220000', 
      'maxResults': 500, 
      'toDate': '201709030000', 
      'query': 'from:1307298884 OR from:355574901 OR from:11996222 OR from:588649189 OR from:117808071 OR from:19401084 OR from:19401550 OR from:735608306585198592 OR from:241642996 OR from:27888518 OR from:56470183 OR from:14861004 OR from:106728683 OR from:19683725 OR from:229379349 OR from:16425419 OR from:20562924 OR from:76138084 OR from:76303 OR from:399111773 OR from:26774590 OR from:795296683533942784 OR from:159669473 OR from:345161879 OR from:4784677524 OR from:22654208 OR from:2227415623 OR from:139114806 OR from:34030321 OR from:140660078 OR from:247503175 OR from:14791386 OR from:3014668581 OR from:279136688 OR from:64798315 OR from:2247970886 OR from:704407090060730368 OR from:82960432 OR from:64523032 OR from:396045731 OR from:45564482 OR from:28645139 OR from:18205191 OR from:734824555051646976 OR from:357026180 OR from:44900997 OR from:14192680 OR from:4813084988 OR from:600958397 OR from:605603344 OR from:577537350 OR from:589419748 OR from:1429761 OR from:401527825 OR from:15937025 OR from:2788847458', 
'next': 'eyJtYXhJZCI6OTAwMjkyNzgzMzk0NjM1Nzc2fQ=='}

Relevant YAML configuration details:

search_params:
    results-per-call: 500
    max-results: 1000000

output_params:
    save_file: True
    results_per_file: 10000000

search_tweets_api:
  account_type: premium
  endpoint: https://api.twitter.com/1.1/tweets/search/fullarchive/history.json

It also appears that these failed queries also count against our paid requests, even though nothing gets written to the output file.

Command-line app refinement and testing

We need to ensure that the command line app works seamlessly for enterprise and premium users. I am also very open to redesigning the specification of arguments via either the command line or a file. Currently, the project is using configparser to pass args to the program, with overrides for redundant args passed via the command line.

Also, are there other use cases to support via the command line? Do we need to examine how the file saving or stdout usage occurs?

Update credential doc for clarity and error handling

In working with @jimmoffitt, we noticed that credential handling setup is not 100% clear.

The flexibility introduced by #14 is perhaps not explicit about the defaults values of both the key file (.twitter_keys.yaml )and the keys in the file ( search_tweets_api ).

Also, there is no graceful error when a KeyError occurs when parsing the credential file.

Using more requests than expected

Hi,

I'm not sure if this is a search-tweets module specific bug or if it is upstream - sorry if the latter. I posted on the twitter developer forum but no response yet (link to forum post at bottom).

I'm using the fullarchive endpoint with premium access and I’ve found I'm using more requests than intended (I’d like to use 1) when grabbing tweets for a handle over the last 90 days. Passing in max_requests to ResultsStream() also doesn’t help. Finally, in this case only 88 tweets are matched by that search query. From my understanding of the docs I should only ever be using 1 request (as max results in 500 and each request returns at most 500 tweets).

raw_data_test = {}

print("Starting: {} API calls used".format(ResultStream.session_request_counter))

def make_rule(handle, to_date, from_date):
    _rule_a = "from:"+handle
    rule = gen_rule_payload(_rule_a,
                        from_date=from_date,
                        to_date=to_date,
                        results_per_call=500)
    return rule

days_to_scrape = 90 
for indx_,handle_list, date in to_scrape[32:33]:
    
    to_datetime = pd.to_datetime(date)
    from_dateime = to_datetime - pd.Timedelta(days_to_scrape, unit='D')
    from_datestring = str(from_dateime.date())
    to_datestring = str(to_datetime.date())

    for handle in handle_list:
        #print(handle)
        print('collecting results...')
        search_query = make_rule(handle, to_datestring,from_datestring)

        print(search_query)
        rs = ResultStream(rule_payload=search_query,
                                      max_results=500,
                                      **search_args)

        results_list = list(rs.stream())
        print("You have used {} API calls".format(ResultStream.session_request_counter))
        raw_data_test[search_query] = results_list
        #time.sleep(2)

Output:
Starting: 0 API calls used
collecting results…
{“query”: “from:xxxxxx”, “maxResults”: 500, “toDate”: “201502050000”, “fromDate”: “201411070000”}
You have used 3 API calls

Please let me know if there is anything further info that would help.

Thanks!

(https://twittercommunity.com/t/using-more-requests-than-expected-with-searchtweets-module-and-fullarchive-endpoint/106722)

Adding all premium operators in API

Is there a way we can extend API params to support all operators provided by Twitter Premium API(fullarchive or 30days). I tried adding from: somename in config.yaml file but didn't worked as expected.

Rate limit handling

From the community forums - it's worth a potential addition to handle rate limits for long-running requests or for other use cases.

Consider this more of a note for future discussion than for a specific implementation strategy, which could range from a time.sleep() call to auto-adjusting calls based on the type of environment (e.g., sandbox vs prod).

result_type parameter premium/enterprise search

Hi,

For standard search, I see a search parameter "result_type" that has the options: mixed, recent, popular. For developer/enterprise search do you guys know if there is a similar parameter to sort the tweet results?

Thanks!

Is it possible to search for more than one user in a query?

Is it possible to search for created by more than one user in a query?

Going off the following provided example:

rule = gen_rule_payload("from:jack",
                        from_date="2017-09-01", #UTC 2017-09-01 00:00
                        to_date="2017-10-30",#UTC 2017-10-30 00:00
                        results_per_call=500)

I tried the following variations with no success:

rule = gen_rule_payload("from:jack", "from:bill",
rule = gen_rule_payload('"from:jack" OR "from:bill"', 
rule = gen_rule_payload("('from:jack', 'from:bill')",
rule = gen_rule_payload("('from:jack' OR 'from:bill',

Incorrectly getting 401 status code despite correct authentication parameters

Followed the readme steps to install the library via pip, added the relevant code in a file (see below), ran as python3 script, and keep getting 401 status code.

Relevant code invoking the library:

from searchtweets import ResultStream, gen_rule_payload, load_credentials, collect_results
premium_search_args = load_credentials("~/.twitter_keys.yaml",
                                   yaml_key="search_tweets_api",
                                   env_overwrite=False)
rule = gen_rule_payload("beyonce", results_per_call=100) # testing with a sandbox account
print(rule)
print("TOKEN is: '{}'".format(premium_search_args['bearer_token']))
tweets = collect_results(rule,
                     max_results=100,
                     result_stream_args=premium_search_args) # change this if you need to
[print(tweet.all_text, end='\n\n') for tweet in tweets[0:10]];

Confirmed that ~/.twitter_keys.yaml has good creds and that the relevant application setup is good - was able to use the bearer token printed from the library in a curl request as follows:

curl -X POST "https://api.twitter.com/1.1/tweets/search/30day/testing.json" -d '{"query": "beyonce", "maxResults": 100}' -H "Authorization: Bearer PASTED_TOKEN_HERE

tweet_mode

How do I get the full text with tweet_mode=extended ?

The tweets I get are truncated.

[KeyError: 'maxResults'] Error when creating count_rule

I tried connecting the the Counts API endpoint via the documentation here:

https://github.com/twitterdev/search-tweets-python#counts-endpoint

count_rule = gen_rule_payload("beyonce", count_bucket="day")

But this throws me the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-33-62bbdf3105fa> in <module>
----> 1 count_rule = gen_rule_payload('beyonce', count_bucket="day") # count_bucked may be "day", "hour" or "minute"

~\Anaconda3\lib\site-packages\searchtweets\api_utils.py in gen_rule_payload(pt_rule, results_per_call, from_date, to_date, count_bucket, tag, stringify)
    128         if set(["day", "hour", "minute"]) & set([count_bucket]):
    129             payload["bucket"] = count_bucket
--> 130             del payload["maxResults"]
    131         else:
    132             logger.error("invalid count bucket: provided {}"

KeyError: 'maxResults'

I ended up specifying results_per_call=500 to the gen_rule_payload arguments, which solved my problem. However this seemed kind of scary when working with a paid API. I dont' know if this is working as intended, but please either fix documentation or code :)

Specifying objects to return within json results

I can return reliably the full json with all the metadata with resultstream and:

[print(tweet) for tweet in tweets]

However I'm having trouble just requesting specific objects of interest (e.g. just the text), using the same methods you can use with the search api:

[print(tweet.full_text) for tweet in tweets]

Or trying to specify using the json structure:

[print(tweet.user.extended_tweet.full_text) for tweet in tweets]

Is there syntax specific to the premium endpoints that's required to trim some of the objects out of the json (ideally keeping the structure of those remaining)?

Add a log message to indicate endpoint switch to the user

The ResultStream constructor selects the /counts or /search endpoint by reading the json rule payload and checking to see if there is a bucket key, (here)

While convenient, this switch can also be confusing to the user. I'd like to propose a logging message if the endpoint is swapped. Perhaps a logging.warning so that it shows in a Jupyter notebook.

Thoughts?

python version compatibility

The merge_dicts function is only compatible with Python 3.5+ due to the {**dict1, **dict2} syntax. It's a trivial fix to make it compatible with older versions of python 3.

Thanks @jimmoffitt for the note.

problem with filter_rules

search_tweets.py --credential-file twitter_keys.yaml --max-results 100 --results-per-call 100 --filter-rule "HayatımınSırrı -is:retweet" --start-datetime "2017-08-21" --end-datetime "2017-08-23" --filename-prefix efsane5 --no-print-stream

I wrote this query that ı want to tweets with specific date and ı just want to not retweets when ı add to "-is:retweet" parameter it give this error:
my account is sandbox premium. How can ı solve this problem.
:HTTP Error code: 422: There were errors processing your request: Reference to invalid operator 'is:retweet'. Operator is not available in current product or product packaging

Error with load_credentials

hello
after installing the packages I created a YAML file as shown here

search_tweets_premium:
  account_type: premium
  endpoint: https://api.twitter.com/1.1/tweets/search/fullarchive/dev.json
  # if you have a bearer token, you can use it below. otherwise, swap the comment marks and use 
  # your app's consumer key/secret - the library will generate and use a bearer token for you. 
#bearer_token: <A_VERY_LONG_MAGIC_STRING> 
  consumer_key: <Sznpkb******************************>
  consumer_secret: <lP8Xr*****************************************************>

but when I run the code as shown here

premium_search_args = load_credentials("C:/Users/DRC/Desktop/.twitter_keys.yaml",
                                       yaml_key="search_tweets_premium",
                                       env_overwrite=False)

I got this error

cannot read file C:/Users/DRC/Desktop/.twitter_keys.yaml
Error parsing YAML file; searching for valid environment variables
Account type is not specified and cannot be inferred.
        Please check your credential file, arguments, or environment variables
        for issues. The account type must be 'premium' or 'enterprise'.
        

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-55-953b455c2646> in <module>
      1 premium_search_args = load_credentials("C:/Users/DRC/Desktop/.twitter_keys.yaml",
      2                                        yaml_key="search_tweets_premium",
----> 3                                        env_overwrite=False)

~\Anaconda3\lib\site-packages\searchtweets\credentials.py in load_credentials(filename, account_type, yaml_key, env_overwrite)
    187                    if env_overwrite
    188                    else merge_dicts(env_vars, yaml_vars))
--> 189     parsed_vars = _parse_credentials(merged_vars, account_type=account_type)
    190     return parsed_vars
    191 

~\Anaconda3\lib\site-packages\searchtweets\credentials.py in _parse_credentials(search_creds, account_type)
     80         """
     81         logger.error(msg)
---> 82         raise KeyError
     83 
     84     try:

KeyError:

if I modified the path in load_credentials and remove the dot before twitter_keys the error message change to be as shown here

premium_search_args = load_credentials("C:/Users/DRC/Desktop/twitter_keys.yaml",
                                       yaml_key="search_tweets_premium",
                                       env_overwrite=False)
---------------------------------------------------------------------------
ScannerError                              Traceback (most recent call last)
<ipython-input-3-e34d8bf3703d> in <module>
      1 premium_search_args = load_credentials("C:/Users/DRC/Desktop/twitter_keys.yaml",
      2                                        yaml_key="search_tweets_premium",
----> 3                                        env_overwrite=False)

~\Anaconda3\lib\site-packages\searchtweets\credentials.py in load_credentials(filename, account_type, yaml_key, env_overwrite)
    179     filename = "~/.twitter_keys.yaml" if filename is None else filename
    180 
--> 181     yaml_vars = _load_yaml_credentials(filename=filename, yaml_key=yaml_key)
    182     if not yaml_vars:
    183         logger.warning("Error parsing YAML file; searching for "

~\Anaconda3\lib\site-packages\searchtweets\credentials.py in _load_yaml_credentials(filename, yaml_key)
     32     try:
     33         with open(os.path.expanduser(filename)) as f:
---> 34             search_creds = yaml.load(f)[yaml_key]
     35     except FileNotFoundError:
     36         logger.error("cannot read file {}".format(filename))

~\Anaconda3\lib\site-packages\yaml\__init__.py in load(stream, Loader)
     70     loader = Loader(stream)
     71     try:
---> 72         return loader.get_single_data()
     73     finally:
     74         loader.dispose()

~\Anaconda3\lib\site-packages\yaml\constructor.py in get_single_data(self)
     33     def get_single_data(self):
     34         # Ensure that the stream contains a single document and construct it.
---> 35         node = self.get_single_node()
     36         if node is not None:
     37             return self.construct_document(node)

~\Anaconda3\lib\site-packages\yaml\composer.py in get_single_node(self)
     34         document = None
     35         if not self.check_event(StreamEndEvent):
---> 36             document = self.compose_document()
     37 
     38         # Ensure that the stream contains no more documents.

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_document(self)
     53 
     54         # Compose the root node.
---> 55         node = self.compose_node(None, None)
     56 
     57         # Drop the DOCUMENT-END event.

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_node(self, parent, index)
     82             node = self.compose_sequence_node(anchor)
     83         elif self.check_event(MappingStartEvent):
---> 84             node = self.compose_mapping_node(anchor)
     85         self.ascend_resolver()
     86         return node

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_mapping_node(self, anchor)
    131             #    raise ComposerError("while composing a mapping", start_event.start_mark,
    132             #            "found duplicate key", key_event.start_mark)
--> 133             item_value = self.compose_node(node, item_key)
    134             #node.value[item_key] = item_value
    135             node.value.append((item_key, item_value))

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_node(self, parent, index)
     82             node = self.compose_sequence_node(anchor)
     83         elif self.check_event(MappingStartEvent):
---> 84             node = self.compose_mapping_node(anchor)
     85         self.ascend_resolver()
     86         return node

~\Anaconda3\lib\site-packages\yaml\composer.py in compose_mapping_node(self, anchor)
    125         if anchor is not None:
    126             self.anchors[anchor] = node
--> 127         while not self.check_event(MappingEndEvent):
    128             #key_event = self.peek_event()
    129             item_key = self.compose_node(node, None)

~\Anaconda3\lib\site-packages\yaml\parser.py in check_event(self, *choices)
     96         if self.current_event is None:
     97             if self.state:
---> 98                 self.current_event = self.state()
     99         if self.current_event is not None:
    100             if not choices:

~\Anaconda3\lib\site-packages\yaml\parser.py in parse_block_mapping_key(self)
    426 
    427     def parse_block_mapping_key(self):
--> 428         if self.check_token(KeyToken):
    429             token = self.get_token()
    430             if not self.check_token(KeyToken, ValueToken, BlockEndToken):

~\Anaconda3\lib\site-packages\yaml\scanner.py in check_token(self, *choices)
    113     def check_token(self, *choices):
    114         # Check if the next token is one of the given types.
--> 115         while self.need_more_tokens():
    116             self.fetch_more_tokens()
    117         if self.tokens:

~\Anaconda3\lib\site-packages\yaml\scanner.py in need_more_tokens(self)
    147         # The current token may be a potential simple key, so we
    148         # need to look further.
--> 149         self.stale_possible_simple_keys()
    150         if self.next_possible_simple_key() == self.tokens_taken:
    151             return True

~\Anaconda3\lib\site-packages\yaml\scanner.py in stale_possible_simple_keys(self)
    287                 if key.required:
    288                     raise ScannerError("while scanning a simple key", key.mark,
--> 289                             "could not find expected ':'", self.get_mark())
    290                 del self.possible_simple_keys[level]
    291 

ScannerError: while scanning a simple key
  in "C:/Users/DRC/Desktop/twitter_keys.yaml", line 5, column 3
could not find expected ':'
  in "C:/Users/DRC/Desktop/twitter_keys.yaml", line 7, column 3

is that because of bearer token?
or there is any mistake in my YAML file?

results_per_call defaults to 100, not 500 as stated in the docs

I ran a collect_results query without setting the parameter results_per_call using a premium (and non-sandbox) endpoint, which supports 500 tweets per request. The docs say results_per_call defaults to the maximum allowed, which is 500.

max_results was set to 2500, thus I expected 5 (or at most 6) requests. But according to my twitter dashboard, 26 requests where used. I guess this corresponds to 2500/100=25 approx.

Dates Are Wrong

Using the premium full archive and search-tweets, I have found a tweet (https://twitter.com/TheFatApple/status/29118205491) that has a date of Oct 29, 2010. However, the API says it's 'created_at' attribute is November 4, 2010, which is wrong. This may be an API related bug, but I'm bringing it up here just to see if anyone else has experienced something similar

Search by coordinate

Hello! Is it possible to include coordinates in the gen_rule_payload argument? Maybe this already exists but I can't find a list of optional arguments. It also seems that you have to have something in the query field right? (Ideally I'd like to collect any tweets that occurred in a specific location over a specific time period, but is that possible with the premium API?) Thanks!

Failed to install by running `pip install searchtweets` on Mac

Hi there!

I got some problems when trying to install the searchtweets package. I run a pip install searchtweets but received the error

....
long_description=open('README.rst', 'r', encoding="utf-8").read(),
    TypeError: 'encoding' is an invalid keyword argument for this function
------------------
Command "python setup.py egg_info" failed with error code 1 in .....

I'm using Mac and I'm trying both on python 2 and python 3 while got the same error. Does anyone have some hint about this? Or should I need to do anything else before installing?
Thank you!

Simplify parsing version in setup.py?

If importing VERSION directly is out of the question, there seem to be a few other ways to simplify parsing the version code.

>>> import re
... 
... 
... def parse_version(str_):
...     """
...     Parses the program's version from a python variable declaration.
...     """
...     v = re.findall(r"\d+.\d+.\d+", str_)
...     if v:
...         return v[0]
...     else:
...         print("cannot parse string {}".format(str_))
...         raise KeyError
... 
>>> # original
... with open("./searchtweets/_version.py") as f:
...     _version_line = [line for line in f.readlines()
...                      if line.startswith("VERSION")][0].strip()
...     VERSION = parse_version(_version_line)
... 
>>> VERSION
'1.7.4'

>>> # no readlines
... with open("./searchtweets/_version.py") as f:
...     _version_line = [line for line in f
...                      if line.startswith("VERSION")][0].strip()
...     VERSION = parse_version(_version_line)
... 
>>> VERSION
'1.7.4'

>>> # use next instead of a list comp + zero index
... with open("./searchtweets/_version.py") as f:
...     _version_line = next(line for line in f
...                          if line.startswith("VERSION")).strip()
...     VERSION = parse_version(_version_line)
... 
>>> VERSION
'1.7.4'

>>> # remove unnecessary str.strip
... with open("./searchtweets/_version.py") as f:
...     _version_line = next(line for line in f
...                          if line.startswith("VERSION"))
...     VERSION = parse_version(_version_line)
... 
>>> VERSION
'1.7.4'

>>> # pass f.read() directly
... with open("./searchtweets/_version.py") as f:
...     VERSION = parse_version(f.read())
... 
>>> VERSION
'1.7.4'

>>> # get rid of parse_version function altogether
... with open("./searchtweets/_version.py") as f:
...     VERSION = re.search(r'\d+.\d+.\d+', f.read()).group()
... 
>>> VERSION
'1.7.4'

Query total tweet count

I think it would be really useful to add an option to get the total amount of tweets for a defined query. This can be done by pulling counts for that query and adding them up. It would allow checking if you have enough requests with the current tier of premium/enterprise (500 tweets/request) and calculate the total cost to run the query. Especially useful if you need to use the premium API only for specific queries (ex: analyzing tweets from users during a specific event).

This would be a similar feature to what I think GNIP PowerTrack used to have, to check the cost of the query before starting it.

I can work on a PR myself if this is considered "non-prioritary" internally, as I'm currently doing this manually and it's quite tiresome.

ResultStream produces max_requests+1 requests

search_args = load_credentials()
rule = gen_rule_payload("rule", results_per_call=100)
rs = ResultStream(rule_payload=rule,
                  max_requests=1,
                  **search_args)
tweets = list(rs.stream())

This ResultStream produces 2 requests (and len(tweets) will be 200).

Add more specific metadata to request headers

For the purposes of tracking usage adoption, it would be nice to include some library and version information in the HTTP requests.

I believe the requests library under the hood of searchtweets sends something similar to 'User-Agent': 'python-requests/1.2.0' in its requests. Perhaps we could modify the headers like (roughly):

version = searchtweets.__version__
useragent = 'search-tweets-python/{}'.format(version)
headers = {'User-Agent': useragent}

I'm not sure if it would be best set in the request() method or in the make_session() method. The requests docs mention doing it in both places (former, latter).

@binaryaaron @fionapigott @jeffakolb are there other metadata things that would be useful in the headers?

Python 2 / 3 compatibility

This is somewhat self explanatory, but this repo has only been built for python 3. Should we extend support to python 2.7? If so, this needs to be done.

'Tweet' object has no attribute 'reply_count'

tweets = collect_results(rule, max_results=500, result_stream_args=premium_search_args)
for tweet in tweets: print(tweet.reply_count)

returns
Exception has occurred: AttributeError
'Tweet' object has no attribute 'reply_count'
but while debugging, I can clearly see the attribute reply_count in tweet. What gives?

PyYAML 5.1 warning

Currently, running searchtweets against the latest PyYAML release results in a warning on loading the configuration file:

credentials.py:34: YAMLLoadWarning: calling yaml.load() without Loader=… is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

More information in this issue.

This is a minor / low priority fix, as the syntax is currently only deprecated, and the code will still work despite the warning being issued.

doesn't work for python 2.7

I was trying to use searchtweets in python 2.7, just tried to import searchtweets would return something like:

return {**dict1, **dict2}
^
SyntaxError: invalid syntax

I changed to python 3.6 and it works fine.

ResultStream only returns first page ?

I use a Full Archive Premium account to search tweets from 2016. However, instead of receiving a list of tweets that are all different, I get a list that contains n times the same tweets. It may be due to a problem of pagination, however the number of unique tweets is 942, so higher than 500. Here is my code :

from searchtweets import ResultStream, gen_rule_payload, load_credentials

premium_search_args = load_credentials(filename="./search_tweets_creds.yaml",
                 yaml_key="search_tweets_api",
                 env_overwrite=False)
rule = gen_rule_payload("(orlando shooting OR pulse) profile_country:US",
                        from_date="2016-06-14 15:00",
                        to_date="2016-06-14 16:00",
                        results_per_call=500)
rs = ResultStream(rule_payload=rule,
                  max_results=11800,
                  max_pages=24,
                  **premium_search_args)
tweets = list(rs.stream())

As a result, I get a list of 11800 tweets that looks like this (with example ids) :

id | created_at
1 | Tue Jun 14 15:59:59 +0000 2016
2 | Tue Jun 14 15:59:58 +0000 2016
...
473 | Tue Jun 14 15:57:31 +0000 2016
1 | Tue Jun 14 15:59:59 +0000 2016
2 | Tue Jun 14 15:59:58 +0000 2016
...
844 | Tue Jun 14 15:54:53 +0000 2016
1 | Tue Jun 14 15:59:59 +0000 2016
...

I used the same code last month without problem, was there any change in the API recently that could cause this ?

[QUESTION] How to make a rule to avoid retweets?

Hello, I'm getting every tweets this way:

rule = gen_rule_payload("rihana", results_per_call=100) tweets = collect_results(rule, max_results=100, result_stream_args=cred)
But I'm not interesting in retweets, so I'm wasting requests filtering tweets after I got them.

I sincerely tried to understand the docs before asking, but I can't understand how PowerTrack Rules really work, and if they have any to do with it.

ResultStream raises a generic HTTPError

After reviewing result_stream.py I noticed that the retry decorator raises an HTTPError with no context surrounding the actual HTTP code encountered. See below:
`
def retry(func):
"""
Decorator to handle API retries and exceptions. Defaults to three retries.

Args:
    func (function): function for decoration

Returns:
    decorated function

"""
def retried_func(*args, **kwargs):
    max_tries = 3
    tries = 0
    while True:
        try:
            resp = func(*args, **kwargs)

        except requests.exceptions.ConnectionError as exc:
            exc.msg = "Connection error for session; exiting"
            raise exc

        except requests.exceptions.HTTPError as exc:
            exc.msg = "HTTP error for session; exiting"
            raise exc

        if resp.status_code != 200 and tries < max_tries:
            logger.warning("retrying request; current status code: {}"
                           .format(resp.status_code))
            tries += 1
            # mini exponential backoff here.
            time.sleep(tries ** 2)
            continue

        break

    if resp.status_code != 200:
        error_message = resp.json()["error"]["message"]
        logger.error("HTTP Error code: {}: {}".format(resp.status_code, error_message))
        logger.error("Rule payload: {}".format(kwargs["rule_payload"]))
        raise requests.exceptions.HTTPError

    return resp

return retried_func

`
Retry section of this page https://developer.twitter.com/en/docs/tweets/search/overview/enterprise.html indicates a throttling of requests when encountering a 503. Also I'd like to be able to respond accordingly / log other HTTP error codes as well. Are there any plans to address this or can I open a PR to do so?

URL-encoded special character ($) returns 422 no viable character

Running:
search_tweets.py --credential-file .twitter_keys.yaml --max-results 100 --results-per-call 100 --filter-rule "%24tsla" --filename-prefix tsla_test --print-stream`

Returns:
ERROR:searchtweets.result_stream:HTTP Error code: 422: There were errors processing your request: no viable alternative at character '%' (at position 1) ERROR:searchtweets.result_stream:Rule payload: {'query': '%24tsla', 'maxResults': 100}

Per Standard Operator docs, I would expect this call to succeed. FWIW, using tweepy with the same query string, the call succeeds.

ResultStream stream() function does not paginate

I'm using a ResultStream object with search args from a ./.twitter_keys.yaml config file that point at a premium FAS endpoint. Our subscription is for the maximum premium monthly rate limit of 2,500 requests.

My YAML file:

search_tweets_premium:
  account_type: premium
  endpoint: https://api.twitter.com/1.1/tweets/search/fullarchive/:my_dev_env.json
  consumer_key: <my_key>
  consumer_secret: <my_secret>

My code (which I'm running from a Jupyter notebook):

from searchtweets import ResultStream, gen_rule_payload, load_credentials, collect_results

premium_search_args = load_credentials(filename="./.twitter_keys.yaml",
                 yaml_key="search_tweets_premium",
                 env_overwrite=False)

rule = gen_rule_payload("from:karenm_2000", results_per_call=500)
print(rule)

rs = ResultStream(rule_payload=rule,
                  max_results=500,
                  max_pages=100,
                  **premium_search_args)

tweets = list(rs.stream())

outputs...

{"query": "from:karenm_2000", "maxResults": 500}
searchtweets.result_stream - INFO - using bearer token for authentication
searchtweets.result_stream - INFO - ending stream at 38 tweets

After carefully doing a manual review of @karenm_2000's timeline, we see that 38 is the number of tweets this user has published in the past 30 days (at the time of this writing). For whatever reason, ResultStream is stopping after 30 days instead of paginating through until it runs into one of the limiting args.

outputs 500 results despite changing config

Hi! I have set the max-results inside my config file to be something other than 500, but the output always has 500 results no matter how I change the config file, even if max-results is smaller than 500. I wonder if there is anything I have done wrong in this case.

renaming package for pypi

I am renaming the package so we can distribute this on pypi. The new name is searchtweets and we should rename the repo as well.

I propose twitterdev/search-tweets-python for a new repo name, after @jimmoffitt's Ruby search API wrapper. Thoughts from @twitterdev/des-science ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.