Code Monkey home page Code Monkey logo

Comments (9)

Sumit2889 avatar Sumit2889 commented on August 15, 2024 1

@BODF
The problem is with the encoding part and it works correctly by making some changes. You can see the working code below.

from bs4 import BeautifulSoup
import re
from geopy.geocoders import Nominatim
geolocator = Nominatim()
from geopy.exc import GeocoderTimedOut
URL_INIT = 'https://twitter.com/'

#list_of_users --> From all scraped tweets, get a set of unique usernames and put them in this list.


#The located userlocations are appended to this list
list_of_found_userlocations = []

#The not located userlocations are appended to this list.
#Maybe they contain some typo or something else like that.
list_of_nonfound_userlocations = []

def parse_url(tweet_user):
    url = URL_INIT+ tweet_user.strip('@')
    return url

for user in list_of_users:
    try:
        url = parse_url(user)
        response = urlopen(url)
    except:
        continue
    html = response.read()
    soup = BeautifulSoup(html, features="lxml")
    location = soup.find('span','ProfileHeaderCard-locationText').text.strip('\n').strip()
    if location:
        if ',' in location:
            splitted_location = location.split(',')
        else:
            splitted_location = re.split('|;|-|/|°|#', location)
        try:
            if splitted_location:
                located_location = geolocator.geocode(splitted_location[0], timeout=100)
            else:
                located_location = geolocator.geocode(location, timeout=100)
            if located_location:
                user_plus_location = (user, located_location)
                list_of_found_userlocations.append(user_plus_location)
            else:
                user_plus_incorrect_location = (user, location)
                list_of_nonfound_userlocations.append(user_plus_incorrect_location)
        except GeocoderTimedOut as e:
            print("Error: geocode failed on input %s with message %s"%(location, e))

print(list_of_found_userlocations)
print(list_of_nonfound_userlocations)

from twitterscraper.

taspinar avatar taspinar commented on August 15, 2024

A while back I had written some code to scrape the location of users.
With this you should be able to scrape the location of most of the users.
Of course the location is not real-time.

From my experience, it is sufficient, because most people fill in their location correctly and if a non existing city / country is filled in, it is not scraped.

If I have some more time in the future, I will add this as a future, but for now I think this should be sufficient.

For this you first need to install the python package geopy. Beautifullsoup4 should already be installed since you have twitterscraper.

pip install geopy

Then you can run the following script:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from geopy.geocoders import Nominatim
geolocator = Nominatim()
from geopy.exc import GeocoderTimedOut
URL_INIT = 'https://twitter.com/'

#list_of_users --> From all scraped tweets, get a set of unique usernames and put them in this list. 

#The located userlocations are appended to this list
list_of_found_userlocations = []

#The not located userlocations are appended to this list. 
#Maybe they contain some typo or something else like that. 
list_of_nonfound_userlocations = []

def parse_url(tweet_user):
    url = URL_INIT+ tweet_user.strip('@')
    return url

for user in list_of_users:
    try:
        url = parse_url(user)
        response = urllib2.urlopen(url)
    except:
        continue
    html = response.read()
    soup = BeautifulSoup(html)
    location = soup.find('span','ProfileHeaderCard-locationText').text.encode('utf8').strip('\n').strip()
    if location:
        if ',' in location:
            splitted_location = location.split(',')
        else:
            splitted_location = re.split('|;|-|/|°|#', location)
        try:
            if splitted_location:
                located_location = geolocator.geocode(splitted_location[0], timeout=100)
            else:
                located_location = geolocator.geocode(location, timeout=100)
            if located_location:
                user_plus_location = (user, located_location)
                list_of_found_userlocations.append(user_plus_location)
            else:
                user_plus_incorrect_location = (user, location)
                list_of_nonfound_userlocations.append(user_plus_incorrect_location)
        except GeocoderTimedOut as e:
            print("Error: geocode failed on input %s with message %s"%(location, e))

from twitterscraper.

ujjwalll avatar ujjwalll commented on August 15, 2024

Hi, the above code is showing me an error of Required bytes not str in location = soup.find.... line, please help me out.

from twitterscraper.

BODF avatar BODF commented on August 15, 2024

Hi, the above code is showing me an error of Required bytes not str in location = soup.find.... line, please help me out.

@ujjwalll this is a problem I also ran into. I think some recent version changes may have deprecated the above code. My work around is below. The only changes are to how BeautifulSoup finds the appropriate line in the Twitter html. My code is slow, it took 45 mins to run through 1000 user names and only find 31 with locations. If you see a way to speed it up please post.

for user in list_of_users:
    try:
        url = parse_url(user)
        response = urlopen(url)
    except:
        continue
    html = response.read()
    soup = BeautifulSoup(html)
    location = False
    a = soup.find('a', {"data-place-id": True})
    if a: # Does the location info exist?
        location = a.string
    if location:
        if ',' in location:
            splitted_location = location.split(',')
        else:
            splitted_location = re.split('|;|-|/|°|#', location)
        try:
            if splitted_location:
                located_location = geolocator.geocode(splitted_location[0], timeout=100)
            else:
                located_location = geolocator.geocode(location, timeout=100)
            if located_location:
                user_plus_location = (user, located_location)
                list_of_found_userlocations.append(user_plus_location)
            else:
                user_plus_incorrect_location = (user, location)
                list_of_nonfound_userlocations.append(user_plus_incorrect_location)
        except GeocoderTimedOut as e:
            print("Error: geocode failed on input %s with message %s"%(location, e))

from twitterscraper.

SurajMeena avatar SurajMeena commented on August 15, 2024

@Sumit2889 I am running your code on around 100 usernames and location list obtained are empty...What can be the possible reason for this. Is the script not working properly ?
Also in some cases user name contains emojis or are in different language. Does this script handle such usernames ?

from twitterscraper.

BODF avatar BODF commented on August 15, 2024

@Sumit2889 I am running your code on around 100 usernames and location list obtained are empty...What can be the possible reason for this. Is the script not working properly ?

@SurajMeena Can you describe what you've done to investigate the problem already? Have you tested a few lines of your input and gone line by line to make sure it gives the expected output? Are there any errors or warnings getting thrown?

Also in some cases user name contains emojis or are in different language. Does this script handle such usernames ?

I haven't observed problems with emojis or language getting parsed.

from twitterscraper.

SurajMeena avatar SurajMeena commented on August 15, 2024

@Sumit2889 I am running your code on around 100 usernames and location list obtained are empty...What can be the possible reason for this. Is the script not working properly ?

@SurajMeena Can you describe what you've done to investigate the problem already? Have you tested a few lines of your input and gone line by line to make sure it gives the expected output? Are there any errors or warnings getting thrown?

Also in some cases user name contains emojis or are in different language. Does this script handle such usernames ?

I haven't observed problems with emojis or language getting parsed.

No, there aren't any errors or exceptions thrown it's just that list_of_found_userlocations and list_of_nonfound_userlocations turns out empty. At first I thought this is because of emojis in username so I removed those usernames still returned list is empty.
Though there is a warning while importing corresponding libraries DeprecationWarning: Using Nominatim with the default "geopy/1.21.0" user_agentis strongly discouraged, as it violates Nominatim's ToS https://operations.osmfoundation.org/policies/nominatim/ and may possibly cause 403 and 429 HTTP errors. Please specify a customuser_agentwithNominatim(user_agent="my-application")or by overriding the defaultuser_agent: geopy.geocoders.options.default_user_agent = "my-application". In geopy 2.0 this will become an exception.

from twitterscraper.

BODF avatar BODF commented on August 15, 2024

@SurajMeena
The Nominatim warning may be worth addressing. I checked my code and I explicitly declared a user agent. I've edited and pasted the code fragment below that allows you to do this. It's a simple change, note the "user_agent" declaration and change it to whatever you'd like.

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="SurajMeena") # needs a useragent

from twitterscraper.

SurajMeena avatar SurajMeena commented on August 15, 2024

@SurajMeena
The Nominatim warning may be worth addressing. I checked my code and I explicitly declared a user agent. I've edited and pasted the code fragment below that allows you to do this. It's a simple change, note the "user_agent" declaration and change it to whatever you'd like.

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="SurajMeena") # needs a useragent

Not working still empty lists....I managed to get user locations using tweepy so my task is done.

from twitterscraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.