Code Monkey home page Code Monkey logo

Comments (14)

edsu avatar edsu commented on August 17, 2024 1

If you install v0.0.24 you will get additional information like this:

2017-01-16 13:59:36,353 - root - INFO - shutting down: new=2 checked=4 skipped=10 elapsed=0:00:01.073984

from diffengine.

edsu avatar edsu commented on August 17, 2024 1

@ruebot nice thing about using the flock is that you can start it up at lower intervals, every 15 minutes or even every 5 minutes...

from diffengine.

ryanfb avatar ryanfb commented on August 17, 2024 1

After thinking about this a bit, I wonder if a few simple changes might help:

  1. Make sure we iterate over the entries list in creation order (this might already be happening, but just looking at the code I'm not sure https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py#L67). This way we're always processing the newest entries first instead of potentially spending a bunch of time doing old entries first - which if you're wrapping your diffengine in a timeout, might starve new entries.
  2. Increase the default threshold. I could see something in the range of .5 (day-old gets checked every half day, week-old gets checked every few days, month-old gets checked every couple weeks) to 1.0 (very simple to understand the age/checking relationship - we could also set an upper bound on the value of hotness to e.g. a month so that we're never waiting more than a month between checks).
  3. In conjunction with 2, reset the creation time if an entry is modified (or add a new modified time attribute and use that to calculate the hotness value). This would allow us to have a higher threshold and still catch the cases where e.g. an article is getting updated frequently over the course of a week, or an old article gets a flurry of updates. This could potentially result in some thrashing when a site makes a global change that causes all their entries to show up as modified after content normalization.

from diffengine.

gregjan avatar gregjan commented on August 17, 2024

I don't have insight into the stats myself. Do these servers respond to HEAD requests accurately? That should be faster than a diff..

from diffengine.

ruebot avatar ruebot commented on August 17, 2024

Some data points from my setup.

CBC - 42 feeds

2017-01-14 10:12:20,055 - root - INFO - shutting down: 0:51:27.529976
2017-01-14 13:32:35,568 - root - INFO - shutting down: 1:02:34.198437
2017-01-14 15:24:19,525 - root - INFO - shutting down: 0:54:17.493559
2017-01-14 17:09:52,855 - root - INFO - shutting down: 0:54:50.911007
2017-01-14 18:11:45,281 - root - INFO - shutting down: 0:56:43.139898
2017-01-14 19:09:45,958 - root - INFO - shutting down: 0:54:44.392014
2017-01-14 20:09:54,507 - root - INFO - shutting down: 0:54:53.252183
2017-01-14 21:11:51,047 - root - INFO - shutting down: 0:56:49.261002
2017-01-14 22:13:16,338 - root - INFO - shutting down: 0:58:15.006552
2017-01-14 23:11:30,542 - root - INFO - shutting down: 0:56:28.920192
2017-01-15 00:12:25,490 - root - INFO - shutting down: 0:57:23.657105
2017-01-15 01:11:45,595 - root - INFO - shutting down: 0:56:43.754114
2017-01-15 02:06:40,166 - root - INFO - shutting down: 0:51:38.215280
2017-01-15 03:05:54,218 - root - INFO - shutting down: 0:50:52.668085
2017-01-15 04:04:57,935 - root - INFO - shutting down: 0:49:56.339932
2017-01-15 05:05:25,381 - root - INFO - shutting down: 0:50:23.064285
2017-01-15 06:06:06,482 - root - INFO - shutting down: 0:51:04.715768
2017-01-15 07:06:25,907 - root - INFO - shutting down: 0:51:24.007759
2017-01-15 08:10:25,069 - root - INFO - shutting down: 0:55:22.783080
2017-01-15 09:15:26,977 - root - INFO - shutting down: 1:00:25.543582

The Globe & Mail - 174 feeds

2017-01-14 09:47:27,868 - root - INFO - shutting down: 0:17:25.983662
2017-01-14 10:17:42,696 - root - INFO - shutting down: 0:17:40.530126
2017-01-14 10:47:57,171 - root - INFO - shutting down: 0:17:55.090977
2017-01-14 11:17:06,091 - root - INFO - shutting down: 0:17:04.567858
2017-01-14 11:47:55,512 - root - INFO - shutting down: 0:17:54.131806
2017-01-14 12:18:14,028 - root - INFO - shutting down: 0:18:12.170724
2017-01-14 12:48:34,178 - root - INFO - shutting down: 0:18:32.800043
2017-01-14 13:17:40,891 - root - INFO - shutting down: 0:17:39.432581
2017-01-14 19:50:17,275 - root - INFO - shutting down: 5:20:15.156602
2017-01-14 23:24:06,302 - root - INFO - shutting down: 2:49:04.247076
2017-01-15 02:32:11,626 - root - INFO - shutting down: 2:57:09.492647
2017-01-15 05:16:41,393 - root - INFO - shutting down: 2:41:39.316417
2017-01-15 08:28:26,709 - root - INFO - shutting down: 2:53:25.176995

La Presse - 272 feeds

2017-01-14 21:05:31,423 - root - INFO - shutting down: 0:00:28.450089
2017-01-14 21:20:20,473 - root - INFO - shutting down: 0:00:19.141869
2017-01-14 23:25:41,218 - root - INFO - shutting down: 0:07:51.681414
2017-01-14 23:52:57,912 - root - INFO - shutting down: 0:08:00.091925
2017-01-15 02:22:17,085 - root - INFO - shutting down: 2:02:15.088546
2017-01-15 04:21:39,277 - root - INFO - shutting down: 1:01:37.524389
2017-01-15 06:21:21,182 - root - INFO - shutting down: 1:01:19.278733
2017-01-15 08:21:24,028 - root - INFO - shutting down: 1:01:21.583565

Toronto Sun - 61 feeds

2017-01-14 12:01:08,299 - root - INFO - shutting down: 0:01:06.367782
2017-01-14 12:04:57,990 - root - INFO - shutting down: 0:01:24.518744
2017-01-14 12:06:05,057 - root - INFO - shutting down: 0:00:30.213326
2017-01-14 14:32:02,813 - root - INFO - shutting down: 1:02:00.581537
2017-01-14 15:51:12,505 - root - INFO - shutting down: 0:41:10.570579
2017-01-14 16:49:51,841 - root - INFO - shutting down: 0:39:49.905569
2017-01-14 17:49:13,600 - root - INFO - shutting down: 0:39:11.447067
2017-01-14 18:50:00,379 - root - INFO - shutting down: 0:39:58.324235
2017-01-14 19:50:15,777 - root - INFO - shutting down: 0:40:14.511174
2017-01-14 20:52:23,072 - root - INFO - shutting down: 0:42:21.243300
2017-01-14 21:52:13,635 - root - INFO - shutting down: 0:42:11.479865
2017-01-14 22:50:59,846 - root - INFO - shutting down: 0:40:58.384031
2017-01-14 23:51:05,598 - root - INFO - shutting down: 0:41:03.421256
2017-01-15 00:51:21,725 - root - INFO - shutting down: 0:41:19.825083
2017-01-15 01:52:39,583 - root - INFO - shutting down: 0:42:37.540521
2017-01-15 02:50:35,141 - root - INFO - shutting down: 0:40:33.583783
2017-01-15 03:52:36,677 - root - INFO - shutting down: 0:42:35.075756
2017-01-15 04:51:29,327 - root - INFO - shutting down: 0:41:26.991347
2017-01-15 05:50:03,298 - root - INFO - shutting down: 0:40:01.526480
2017-01-15 06:52:04,578 - root - INFO - shutting down: 0:42:02.599009
2017-01-15 07:51:25,814 - root - INFO - shutting down: 0:41:23.516856
2017-01-15 08:52:46,569 - root - INFO - shutting down: 0:42:45.185334

Canadaland - 4 feeds

2017-01-14 21:24:59,903 - root - INFO - shutting down: 0:24:57.804095
2017-01-14 21:55:25,131 - root - INFO - shutting down: 0:25:22.970673
2017-01-14 22:24:50,952 - root - INFO - shutting down: 0:24:49.555501
2017-01-14 22:55:09,853 - root - INFO - shutting down: 0:25:08.089897
2017-01-14 23:25:11,017 - root - INFO - shutting down: 0:25:08.898734
2017-01-14 23:54:59,516 - root - INFO - shutting down: 0:24:57.475271
2017-01-15 00:25:12,466 - root - INFO - shutting down: 0:25:10.616999
2017-01-15 00:54:57,652 - root - INFO - shutting down: 0:24:55.844422
2017-01-15 01:25:52,581 - root - INFO - shutting down: 0:25:50.593156
2017-01-15 01:54:51,265 - root - INFO - shutting down: 0:24:49.347676
2017-01-15 02:25:35,087 - root - INFO - shutting down: 0:25:33.488282
2017-01-15 02:55:05,424 - root - INFO - shutting down: 0:25:03.990036
2017-01-15 03:24:45,634 - root - INFO - shutting down: 0:24:43.871369
2017-01-15 03:54:36,158 - root - INFO - shutting down: 0:24:34.183754
2017-01-15 04:24:56,073 - root - INFO - shutting down: 0:24:54.569207
2017-01-15 04:54:58,330 - root - INFO - shutting down: 0:24:56.910666
2017-01-15 05:24:24,081 - root - INFO - shutting down: 0:24:22.412581
2017-01-15 05:54:26,942 - root - INFO - shutting down: 0:24:25.517462
2017-01-15 06:24:55,516 - root - INFO - shutting down: 0:24:53.228180
2017-01-15 06:54:40,924 - root - INFO - shutting down: 0:24:39.055811
2017-01-15 07:25:02,262 - root - INFO - shutting down: 0:25:00.001297
2017-01-15 07:54:44,987 - root - INFO - shutting down: 0:24:43.380605
2017-01-15 08:24:53,357 - root - INFO - shutting down: 0:24:52.005523
2017-01-15 08:55:06,777 - root - INFO - shutting down: 0:25:04.727537
2017-01-15 09:24:25,455 - root - INFO - shutting down: 0:24:23.347182

from diffengine.

edsu avatar edsu commented on August 17, 2024

@ruebot Wow, Toronto Sun has 61 feeds? Did you mean entries? I'm adding the number of entries checked during the run, and the number skipped to the log output to help.

from diffengine.

ruebot avatar ruebot commented on August 17, 2024

@edsu yep. La Presse has 200+, and some 404... which is fun.

from diffengine.

ruebot avatar ruebot commented on August 17, 2024

updating now

from diffengine.

ruebot avatar ruebot commented on August 17, 2024

Oh, merge conflicts. I'll update #11.

from diffengine.

edsu avatar edsu commented on August 17, 2024

I had a runtime error in there so you'll need to install a new version. FWIW I'm developing the code separately from the installed version (installed with pip).

from diffengine.

ruebot avatar ruebot commented on August 17, 2024

Updated stats:

  • Toronto Sun
2017-01-16 10:10:11,689 - root - INFO - shutting down: new=0 checked=4 skipped=91 elapsed=0:00:09.998173
  • La Presse
root - INFO - shutting down: new=0 checked=8 skipped=17 elapsed=0:00:12.566924
  • Canadaland
2017-01-16 09:30:03,538 - root - INFO - shutting down: new=0 checked=0 skipped=59 elapsed=0:00:01.263901
2017-01-16 10:00:03,233 - root - INFO - shutting down: new=0 checked=0 skipped=59 elapsed=0:00:01.347719
  • CBC
2017-01-16 10:02:09,287 - root - INFO - shutting down: new=0 checked=78 skipped=105 elapsed=0:02:07.376921
  • The Globe & Mail
2017-01-16 10:22:17,488 - root - INFO - shutting down: new=0 checked=107 skipped=52 elapsed=0:04:03.235425

from diffengine.

edsu avatar edsu commented on August 17, 2024

Cool, it'll be interesting to watch these over time. What is your cron schedule like, and are you using the flock trick to prevent concurrent jobs--until diffengine locks for you?

from diffengine.

ruebot avatar ruebot commented on August 17, 2024

I had:

  • Canadaland; every 30 mintues
  • Toronto Sun; every hour
  • La Press; every 2 hours
  • CBC; every 2 hours
  • The Globe & Mails; every 3 hours

Now I'm giving this a shot again:

0,30 * * * * /usr/bin/flock -xn /tmp/canadaland.lock -c "/usr/local/bin/diffengine /home/nruest/.canadaland"
#0,30 * * * * /usr/local/bin/diffengine /home/nruest/.diffengine
0,30 * * * * /usr/bin/flock -xn /tmp/globemail.lock -c "/usr/local/bin/diffengine /home/nruest/.globemail"
0,30 * * * * /usr/bin/flock -xn /tmp/torontosun.lock -c "/usr/local/bin/diffengine /home/nruest/.torontosun"
0,30 * * * * /usr/bin/flock -xn /tmp/cbc.lock -c "/usr/local/bin/diffengine /home/nruest/.cbc"
0,30 * * * * /usr/bin/flock -xn /tmp/lapresse.lock -c "/usr/local/bin/diffengine /home/nruest/.lapresse"

from diffengine.

edsu avatar edsu commented on August 17, 2024

@gregjan I guess feed/entry fetching could use HTTP headers to avoid getting the full representation. But I still wouldn't want to do thousands of HTTP HEAD requests every time it runs. So I think some kind of logic for backing off would still be desirable?

Currently I have a time.sleep embedded that prevents diffengine from doing more than 1 request per second. I wanted to do this to be nice to the server...and also not get blocked. So that's the main thing slowing things down at the moment.

If multiple sites are being monitored I guess diffengine could stripe the requests across them, rather than working on a feed at a time...

from diffengine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.