Code Monkey home page Code Monkey logo

Comments (16)

tomwalter2287 avatar tomwalter2287 commented on September 4, 2024

Sounds good!

On Tue, Mar 29, 2016 at 5:22 AM, Emil Kirkegaard [email protected]
wrote:

Currently, the scraper picks users at semi-random and scrapes them. Then
picks some more, etc. Altho there are hundreds of thousands of users, doing
it this way makes it possible that the same user will get scraped twice.
This potentially wastes the scraper's time.

To avoid this problem, one can create a user file that keeps track of
which users were scraped and number of questions answered and when they
were scraped. This should be simple enough, so I will try.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#3

from okcubot2.

Deleetdk avatar Deleetdk commented on September 4, 2024

I have implemented saving the user info to users.csv in 0450111

It works on my end, but some settings were incorrectly changed I see.

from okcubot2.

tomwalter2287 avatar tomwalter2287 commented on September 4, 2024

So you want to remove the changes in .gitignore and settings.py in my part?

On Tue, Mar 29, 2016 at 6:27 AM, Emil Kirkegaard [email protected]
wrote:

I have implemented this feature in 0450111
0450111

It works on my end, but some settings were incorrectly changed I see.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#3 (comment)

from okcubot2.

Deleetdk avatar Deleetdk commented on September 4, 2024

Just use git pull before you use git commit. This updates your local version with the server's, so that there are no conflicts. Make sure that your .gitignore file is correct because otherwise you are uploading temporary files (those ending with ~) and data files (those in data/) to the repository. If you look at https://github.com/Deleetdk/OKCubot2/blob/master/.gitignore#L64 you will see that I have excluded these.

from okcubot2.

tomwalter2287 avatar tomwalter2287 commented on September 4, 2024

I pushed new version.

Looking forward to your response.

On Tue, Mar 29, 2016 at 6:36 AM, Emil Kirkegaard [email protected]
wrote:

Just use git pull before you use git commit. This updates your local
version with the server's, so that there are no conflicts. Make sure that
youre .gitignore file is correct because otherwise you are uploading
temporary files (those ending with ~) and data files (those in data/) to
the repository. If you look at
https://github.com/Deleetdk/OKCubot2/blob/master/.gitignore#L64 you will
see that I have excluded these.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#3 (comment)

from okcubot2.

tomwalter2287 avatar tomwalter2287 commented on September 4, 2024

I pull your original project.
So this issue must be solved.

from okcubot2.

tomwalter2287 avatar tomwalter2287 commented on September 4, 2024

I think this issue is already solved.

from okcubot2.

Deleetdk avatar Deleetdk commented on September 4, 2024

We need to test it. I tried testing it with the --u option. However, it still scrapes the user twice.

from okcubot2.

tomwalter2287 avatar tomwalter2287 commented on September 4, 2024

Twice?

It scrapes the user only once.

from okcubot2.

Deleetdk avatar Deleetdk commented on September 4, 2024

If I run the scraper twice with the same user in --u, the profile is scraped twice. It should be skipped the second time if the number of questions answered in the profile is the same as the number in users.csv. Make sure that this skipping feature is disableable with a command line argument. E.g. --noskip.

from okcubot2.

tomwalter2287 avatar tomwalter2287 commented on September 4, 2024

Hi, Emil.

I pushed new version.
This issue is solved in this version.

from okcubot2.

Deleetdk avatar Deleetdk commented on September 4, 2024

I tested this with user mama_crossasaur. This user has not answered any hidden questions. The code correctly skips her. I also tried the --noskip argument. Also worked.

python start.py PlainSeagull PlainSeagullPlainSeagull --u mama_crossasaur
python start.py PlainSeagull PlainSeagullPlainSeagull --u mama_crossasaur --noskip

There is a problem with users that have answered some questions privately. The scraper does not scrape them, so they do not get counted. However, the scraper uses the number shown in the profile to make the comparison, hence it still scrapes them.

The solution is to change the number used for the comparison to use the one shown in the profile as well. I will make this change myself.

from okcubot2.

Deleetdk avatar Deleetdk commented on September 4, 2024

Can you add the number of questions answered to target_info? Call it m_numberanswered. In that case, I can get the information I need in the save_as_csv function.

from okcubot2.

tomwalter2287 avatar tomwalter2287 commented on September 4, 2024

I sent you new version.

from okcubot2.

Deleetdk avatar Deleetdk commented on September 4, 2024

Good. I am currently doing a large test of the scraper (scraping 1000 users). I am looking to see if there are more bugs that we just haven't seen yet.

I will try your new version after I'm done with that.

from okcubot2.

Deleetdk avatar Deleetdk commented on September 4, 2024

Fixed in 4b9a67a

from okcubot2.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.