Comments (6)
Aah okay, I think I see what's going on. I think this is a mining issue that stems from how language mining ranges are created, split and iterated. If a repository is updated very frequently like PyTorch then it's constantly moved to the end of the very search results that we are mining. This, coupled with the fact that we take some precautions to avoid over-mining, causes this particular issue, where a very active repository is missed in updates. Thank you for bringing this to my attention, I'll devise a fix for this.
from ghs.
If I understood correctly, you would like to be able to filter both by the last commit and last update date? In its current state, we only allow the former, but we do include all the information in the export. We can also introduce support for filtering by non-code updates by adding another date range input group. The reason for its initial omission was that the date of the last repository page update is not as interesting to researchers as the date of the last commit. Not only that, but people also typically confuse the two, so we only gave filtering options by the one that was more in demand.
To answer your question regarding updates: we crawl each language by the last repository push date. Operating under the assumption that active projects are regularly pushed to, this implies that new data will eventually be encountered by the crawler if there are pushes to the remote. However, detecting non-code updates (such as changes to issue labels and the repository list of topics) would most likely entail some form of background job in order for our data to maintain complete consistency with GitHub. As of right now, we already make use of two scheduled tasks, one for checking if repositories in our database still exist on GitHub, and another that computes line information through static analysis of the last commit on the default branch. I'll give this some more thought, but as far as this "data maintenance" task is concerned, it will either replace one of the aforementioned jobs, or all three might be merged together into one.
from ghs.
I would like to filter by last commit, but it turns out that I miss quite some repositories this way. For example, PyTorch repo has not been updated for two years, so it does not pass the filter "last commit after 01.01.2023). However, PyTorch repo is regularly updated so I expect it to be found among the results
from ghs.
Thanks a lot! 🚀
from ghs.
I merged the changes just now. Results should take effect after some time.
from ghs.
from ghs.
Related Issues (20)
- Update database dump
- "Size of codebase" `max` filters not working in conjunction with "Language" filter HOT 1
- Update database dump
- Update database dump
- How the token mechanism works for the CrawProjectJob
- docker deploy error: when RUN mvn -e --no-transfer-progress clean install -am -DskipTests HOT 4
- Update database dump
- Database-level deadlock when metrics are updated during statistics refresh
- Cross-job deadlock originating in `LanguageService`
- Add support for disabled and locked repositories HOT 1
- Update database dump
- Introduce better handling for `InvalidUsernameException` during analysis HOT 2
- Optimise `CrawlProjectsJob` to use fewer GitHub `core` resources
- Update database dump
- Update database dump
- Update database dump
- Update database dump
- Add sorting option for results display
- Update database dump
- Update database dump HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ghs.