Code Monkey home page Code Monkey logo

debian / dcs Goto Github PK

View Code? Open in Web Editor NEW
197.0 36.0 43.0 25.91 MB

Debian Code Search (codesearch.debian.net) is a search engine that searches through all the 130 GB of open source software that is included in Debian. Supports regular expressions!

License: Other

Shell 0.52% Go 73.43% HTML 5.65% Makefile 0.19% CSS 4.73% JavaScript 5.78% C 9.44% Kaitai Struct 0.13% Dockerfile 0.13%
debian grep linux open-source search search-engine source source-code

dcs's Introduction

README

Debian packages, maintains and distributes many projects developed using GitHub. This account was created to facilitate push/pull interactions with the upstream developers of such projects. If you maintain a package whose upstream developers use GitHub, please feel free to join this group and mirror such project here.

This account is not intended to serve as the canonical (specified with Vcs-* fields of debian/control) location for corresponding Debian source packages. Most often such repositories should be made available on the Debian project's public forge Salsa to guarantee autonomy.

How to join

Open a new issue with a signed statement asking to be added to the organization. The signature needs to be made with your PGP key currently in the Debian keyring. All active Debian Developers will be approved.

Tips

You might find following tools available from Debian useful for your interaction with GitHub

github-backup backs up everything GitHub knows about a repository, to the repository

Acknowledgements

Many thanks to the GitHub admins for their prompt action to release the previous (unused) "Debian" account.

Disclaimers

This GitHub organization is not an endorsement of GitHub by Debian. Debian does not maintain or distribute the GitHub engine codebase because it is not available under free and open-source license (see Wikipedia for a list of available free and open-source alternatives). Moreover, this GitHub organization is not an official part of the Debian project. It is maintained by individual Debian developers (signed below) with the sole purpose of being useful.

-- Charles Plessy [email protected] Thu, 14 Jun 2012 09:11:55 +0900

-- Yaroslav Halchenko [email protected] Thu, 14 Jun 2012 13:22:03 -0400

dcs's People

Contributors

alexmyczko avatar breunigs avatar hitzi avatar intrigeri avatar jamessan avatar jspricke avatar jwilk avatar kevinoid avatar krzsas avatar l2dy avatar lamby avatar maikf avatar mitya57 avatar nbenitez avatar pabs3 avatar rgeissert avatar stapelberg avatar yath avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dcs's Issues

Double spaces in query return no results

The search only returns a result if there is a single space between the package and the string (package:XXX string). When having multiple spaces between the package and the string the search returns no results.

allow to only search latest source package version

Debian Sid often contains the same source package in multiple versions. This is the distribution of number of versions per source package right now, first column the number of source package and the second column how many versions these source packages have:

20940: 1
550: 2
31: 3
4: 4
4: 5
2: 7
1: 16

(the source package with 16 different versions is gcc-4.9)

It would be nice if one could limit the results to the latest source package version only. This would be very useful for the 592 source packages that have more than one version in sid.

allow to search for a source package containing a specific filename/path

Hi,

it would be useful if instead of searching the content of files in source packages, one could search for all source packages that contain a file with a specific name or a certain path. It would then be possible to list source packages which, for example include a CMakeLists\.txt or a debian/.+\.doc-base.* (notice the usage of a regex)

Result view looks broken for queries with few results

I tried the query "nfc_initiator_transceive_bytes" today, which only has one page of results, this is a screenshot from the end of the page:
Screenshot
I am fairly certain that it should not look like that.

When I tried "append_dot_mydomain" as a query with probably few results, I saw similar problems.

Include more information with the result JSON URI

It would be nice to get some more information on the result (not only the packages) when querying the JSON file. It'd be ideal if the JSON could optionally include the files that contained the matches as a sub-array per package. To stay compatible the packages.json part of the generated link of the result page could be extended to allow "files.json" which will provide these addtional information.

analyze logfiles to gather signals?

Not sure if it’s worth it, i.e. if it can influence our ranking enough to have any kind of measurable impact.

Anyhow, it’d be interesting to generate a token on each results page and then include that in /show links. Then, analyze which results are popular/satisfying for a given query.

From my subjective impression, it should be pretty easy to see this in logfiles.

javascript redirection to results does not escape search terms

When submitting a GET request to the /search interface, the initial results look correct. However, the page contains a javascript snippet that redirects to a URI of the form /results/<query>/page_0. This works fine for simple queries, but not if they contain chars that need to be escaped.

For example, a GET request of

https://codesearch.debian.net/search?q=glob+path%3Alib%2Fglob%2Fglob.h

redirects to

https://codesearch.debian.net/results/glob%20path:lib/glob/glob.h/page_0

where it is clear the slashes have not been escaped correctly. The initial results page contains the javascript snippet:

<script type="text/javascript">
<!--
if (location.pathname.substr(0, '/search'.length) === '/search') {
    window.location.replace('/results/glob path:lib\/glob\/glob.h/page_0');
}
-->
</script>

investigate current best-practice Go RPC method

Maybe it’d be better to not use HTTP+JSON between dcs-web/index/source backends. In case there is a better way of doing RPCs in Go (my last state is that they wanted to work on something eventually), it is worth a look if that makes serving results significantly faster.

Migrate monitoring from home-grown solution to prometheus

This issue tracks commits for the migration.

Rough outline:

  1. Add /metrics handlers to every binary which currently provides /varz.
  2. Set up prometheus.
  3. Set up graphs.
  4. Compare graphs to make sure there are no unexpected differences.
  5. Delete old /varz code.

ignore trailing space after package: directive

The following search string works: "debian package:abook" but the following does not: "debian package:abook ". Notice that the latter has a trailing space. This space should be ignored.

Investigate how many packages contain non-UTF-8 and perhaps implement a workaround

There have been at least 3 user reports of packages that ship files that are not encoded in UTF-8:

  • grafx2 2.3
  • lcms2 2.6
  • logidee-tools
  • units-filter 3.7-2 src/unites.[ly]

When building the next index, we should make sure to gather metrics to determine how many packages are affected in total and then decide whether we want to add some sort of workaround or just fix the affected packages upstream.

make it easy to bring up another instance for verifying that a new index is served correctly

This includes:

  • adding a -listen_host option to source-backend
  • possibly unifying index-backend’s -listen to -listen_host
  • creating a little script which generates a new canary config by replacing the port numbers from the live config so that any changes are always taken over without having to care
  • check from the script whether the new site serves (also check before that it doesn’t serve)
  • check the index size and see if it differst more than x%
  • if everything is okay, swap the indexes
  • in dcsindex and dcsunpack, switch from log to printf to get the output on stdout instead of stderr

This enables automatic updates from crontab :).

verify the udd import works correctly

We need to catch the case that the udd popcon_src dump wget runs into a 404 or that the data file is not updated for some reason.

Otherwise, this results in stale ranking data without anyone ever noticing, since this is not a thing typically checked.

page list at bottom when grouping by source package

(I reported this already to Michael in private mail, but never got any reply. I'm filing it here in hope in won't get lost this time.)

I wanted to try out the "Group results by source package" feature.
But unfortunately when it's enabled, the list of pages is available only at the top of the page.
Could you add it also to the bottom of the page, where it would be much more convenient?

dcs-varz-to-influxdb can time out

At some point, I realized that dcs-varz-to-influxdb thought it was still connected to a dcs-web instance that was turned down at that point:

dcs-varz- 6639 root    3w  IPv4            8509323      0t0     TCP 10.208.130.138:54610->10.177.148.19:28080 (ESTABLISHED)

Probably, there are timeouts missing.

investigate using git for storing data

using git might result in less seeks for retrieving data, also less disk usage:

• git add (or maybe git fast-import) (big packages first?)
• git commit (per package or once?)
• git gc at the end

cleanup/unify command line flags and help lines

We have mixed-case command line flags currently and some are even used with slightly different meanings, e.g. -monitorPath.

To resolve this issue, find(1) all executables within the dcs package, run -help on each of them and make sure the output is consistent as a whole.

add /requestz for source-backend

It seems like the source backend is often slow. I would like to introspect its requests at runtime to see what is going on. This needs to keep track of how long a specific query has been running.

git describe output should end up in the version number

This also requires us to use tags. I suggest using version numbers as tags (e.g. 1, 2, 3, …). Every time one sets a new tag, that marks a version which should be deployed on codesearch.d.n

In general, this is a requirement to making the Debian package be useful and not only a file container.

dcs-index: re-use existing indexing where possible

It looks as if dcs-index re-generates its entire index every time. This is a process which takes several hours; is it necessary? Can you record which packages have changed since the last run and only re-index those source files?

negative keywords in the search query

Currently, you cannot use e.g. "foo -package:sudo", but it would be useful in some situations.

As an example use-case, psychon mentioned he wanted to search for call sites of cairo-xcb, but exclude the cairo package itself.

non-JS search page includes HTTP resources

If you have JS disabled, search pages (such as https://codesearch.debian.net/search?q=mooooo) include HTTP resources that makes Iceweasel upset:

Iceweasel has blocked parts of this page that are not secure

I guess the problematic code is:

<link rel="stylesheet" href="http://yandex.st/highlightjs/7.0/styles/default.min.css">
<script src="http://yandex.st/highlightjs/7.0/highlight.min.js"></script>

I think it should be safe to remove these lines, as you don't get any syntax highlighting with JS disabled anyway.

make more machine readable

Hi,

correct me if I'm wrong but as far as I can see, the only machine readable interface in codesearch is the json results listing the packages containing a match for the current query. But obtaining this json data requires parsing the HTML of the result page. It would be nice if it was possible to make a query which returned either the link to the json or the json directly.

Make dcs-web automatically discover index-backends according to a given pattern

E.g. -index_backends=localhost:3031x would lead to dcs-web trying 30310 to 30319. Of course, this has a limit of 10 index shards, and we are currently at 6, but it’s a good improvement nevertheless.

The result of this change will be that the service files (and cmd/dcs-batch-helper/batch-helper.go) will not have to be changed when the amount of shards changes.

group search results by source package

seems I found the bug tracker for codesearch :-)

codesearch is already awesome, what would make it even more awesome to do archive wide greps and mass bug filings, is the ability to group the search results by source package and only show a summary per source package which can be expanded via a +

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.