Code Monkey home page Code Monkey logo

linguist's People

Contributors

alhadis avatar alindeman avatar arfon avatar aroben avatar bkeepers avatar brandonblack avatar brianmario avatar decimalturn avatar draegtun avatar dragonmux avatar fushnisoft avatar gjtorikian avatar haileys avatar ismailarilik avatar josh avatar larsbrinkhoff avatar lildude avatar mislav avatar nixinova avatar pchaigno avatar rick avatar sahildua2305 avatar scottmangiapane avatar sheerun avatar smola avatar sparkyswidgets avatar stevepiercy avatar tesch1 avatar tnm avatar vmg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

linguist's Issues

Shipped Classifier cannot be trained

It's not clear whether this limitation is intentional or if this is a side effect of the YAML loading, but it's not possible to update the Classifier instance with a new language.

I'm trying to learn new languages to the already existing classifier at the smallest cost possible and I'm trying to follow the following workplan:

  • Add new languages in the classifier and train the classifier with an "adequate" volume of data
  • Reduce the number of tokens for the new languages so that the number of classified tokens remains low to preserve performance (according to the rdoc, #gc should be the one I have to call, but according to the source, it does not do anything. It this something you plan to implement ?)

Do you think this is an acceptable use of your library ?

Right now, I'm duck typing Language to feed Classifier#train, this seems to be enough for it to work. Because the Classifier is not dependent on Language at all, maybe #train could simply use a String as parameter (and #classify return Strings too). This would greatly simplifies the interop with your lib :-)

Following, a simple test-case and patch that allows the test-case to pass.

Cheers,
Pierre.

MaxMSP files still not recognized

Hello,

few weeks ago (remember ? #208) we added MaxMSP samples in the JSON folder ; but now files are detected as JavaScript. MaxMSP code/patcher is a graph of objects, dynamically load at runtime ; it is save as JSON but have nothing related to JavaScript.

IMHO the only solution should be to add extensions to "languages.yml" : ".mxt" is the old format (Max 4) ; Since Max 5 the extensions are ".maxpat" and ".maxhelp".

Binary *.n files are Neko (haXe) applications, not Nemerle code

Linguist is flagging any file with a *.n extension as Nemerle, but the extension is used by Neko binary code.

Since this is compiled code, I don't think it should be counted towards any source code total -- but it should not be flagged as Nemerle!

For example, I have a project which includes haXe source code, that compiles to a Neko application for processing Javascript, building JS projects, etc. 68% of the file total is the compiled *.n application, while the rest is the haXe source code.

do not process files in .linguist-ignore

It would be nice if linguist would be able to read a .linguist-ignore file at the root of the project (or any other name) to be able to not process some files. These files (which can either be auto-generated or imported) are usually not in the same language that the initial project, and may become eventually quite big, so making the statistics completely wrong.

If you thing that feature is useful, I'm happy to propose a patch.

Git commit

I would be nice to have a highlighter for git commits, so I could paste the output of git show around "```commit" and it would look nice.

(p.s., I know about the diff highlighter, I'm mainly talking about making the message and the metadata before it look nice)

Classifier#to_yaml fails with shipped Classifier

Hi,

I'm trying to train the Classifier and hence to serialize it to disk. I run into an issue while trying to serialize the default Classifier:


irb(main):006:0>  Linguist::Classifier.instance.to_yaml($STDOUT)
ArgumentError: comparison of Array with Array failed
        from /home/oct/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/github-linguist-2.0.1/lib/linguist/classifier.rb:172:in `sort'
        from /home/oct/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/github-linguist-2.0.1/lib/linguist/classifier.rb:172:in `block in to_yaml'
        from /home/oct/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/github-linguist-2.0.1/lib/linguist/classifier.rb:170:in `each'
        from /home/oct/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/github-linguist-2.0.1/lib/linguist/classifier.rb:170:in `to_yaml'
        from (irb):6
        from /home/oct/.rbenv/versions/1.9.3-p194/bin/irb:12:in `<main>'

Drop mime-types

Try to get our current mime-type extensions pushed upstream to the mime-types lib. Then try to decouple integration from Linguist. Language detection shouldn't be dependent on any sort of mime type.

add .pl as Prolog extension

At the moment it is recognized as Perl.

edit: Spelling. Both English and Perl are not my native language ;-)

Add .elf extension

I think that all *.elf files should be marked as binary automatically (without reading the file)

Support highlighting Twig templates

The syntax of Twig templates is equivalent of the Jinja one (but for PHP projects instead of Python ones) so it could probably be done by reusing the Jinja lexer.
Twig is the default templating engine for Symfony2 (which uses Github) so it would help a lot to have proper highlighting for .twig files.

Description of test-suite running

The last part of the README file talks about using some bundle thing, which I guess is some ruby utility. Maybe add some more exact description for the uninitiated masses?

Binary files detected as Perl

Compile in Linux this simple assembly program using ("as exit.s -o exit.o;ld exit.o -o exit;rm exit.o"):
.section .data
.section .text
.globl _start
_start:
movq $111, %rdi
movq $60, %rax
syscall
And run "bundle exec linguist folder" you will see this:
88% Perl
12% Assembly

Coq / Verilog Misdetections

Linguist is getting Verilog and Coq confused (see Verilog projects
included in https://github.com/languages/Coq and Coq projects included
in https://github.com/languages/Verilog). Both use .v files. I've gone
through the commit history and the first place that I can get it to
fail is at 4484011, however it may be
failing one commit before that at
c114d71. I can't tell for the latter
commit as that fails the Matlab / obj-c case first. Everything passes
if you go one commit earlier.

I'm using some of my Verilog files to test it, specifically, the files
sitting in https://github.com/seldridge/verilog, and linguist just
isn't having it. Linguist continues to pass for the one test file
(sha-256-functions.v) currently in use. I'm no Ruby guy, so I haven't
attempted to look into this in any significant depth beyond the regex
in blob_helper.rb. This doesn't seem to be the issue as it's picking
up the important matches in my testcases, namely comment structure and
the "module" keyword.

Erlang escript bundle is treated as JavaScript

Escript bundle is a compressed Erlang script. Linguist detect it incorrectly as a JavaScript:

$ file ./rebar
./rebar: a escript script text executable
$ linguist ./rebar
./rebar: 0 lines (0 sloc)
  type:      Binary
  mime type: text/plain
  language:  JavaScript
$

...so many Erlang projects that are shipped with rebar build tool script may be detected as JavaScript projects alghough they are pure-Erlang!

Deep content inspection tweaking

I found the place where #! files are analyzed for the right language, but I don't see anywhere a way to extend it. In our case, the simplest way to identify a Racket file would be to look for a #lang line (see example here). A less precise but possibly more broadly useful heuristic is to look for an exec foo line near the top of the file.

Either way, it's not clear whether this is intended to be customizeable, and if so, how to do it.

Objective-C wrong recognition

I can't understand why linguist detect my main project language as Objective-C. It's completely written in C++ (Qt). I don't know Ruby language, so I can't find problem. Can anyone help me?

P.S. My project does not have any *.mm or *.m files. It has only *.h, *.cpp, *.ui, *.qrc, *.css, *.png files.
P.P.S. Problem in GitHub "language color bar" (at the right top of repo page). It's OK with main language.

Allow specifying an ignore file for language statistics

Some repositories (like SignalR), have samples that include common javascript libraries like jQuery etc. and github ends up classifying the project as javascript instead of C# (in this particular case). Nothing is wrong with this at a high level since jQuery is javascript, but for project maintainers that want more control over statistics need a way to opt out of this behavior.

I see 2 options:

  • Short term hack: Exclude commonly used js files. This will handle some scenarios but you'll have to exclude multiple versions of the library (unless you had wildcard support).
  • Longer term solution: Allow a repository to have a .lignore or equivalent (I suck a naming) that uses glob syntax to exclude files to be processed for language statistics.

README

Write up a more complete README.

Scores sent back by the lib are curious

Hello,

The documentation states that it should returns floats. On my installation, it returns negative numbers:

[[#<Linguist::Language name=PHP>, -66.98989614319586],
 [#<Linguist::Language name=JavaScript>, -68.77510897386178],
 [#<Linguist::Language name=Ruby>, -70.7837674453772],
 [#<Linguist::Language name=Perl>, -71.16156437444059],
 [#<Linguist::Language name=Gosu>, -72.90117504252562],
 [#<Linguist::Language name=Python>, -73.0532406574862],
 [#<Linguist::Language name=Objective-C>, -74.10993364147689],
 [#<Linguist::Language name=TeX>, -77.81775680913668],
 [#<Linguist::Language name=Java>, -78.66295010514327],
 [#<Linguist::Language name=Kotlin>, -79.19112391377584],
 [#<Linguist::Language name=Scala>, -79.596874273976],
 [#<Linguist::Language name=C++>, -80.16597822216151],
 [#<Linguist::Language name=CoffeeScript>, -83.44077180874064],
 [#<Linguist::Language name=Apex>, -83.80881093343098],
 [#<Linguist::Language name=C>, -85.47097078986161],
 [#<Linguist::Language name=AppleScript>, -85.68956917025051],
 [#<Linguist::Language name=SCSS>, -86.60214237229394],
 [#<Linguist::Language name=Groovy>, -86.89541966825266],
 [#<Linguist::Language name=Shell>, -87.43588353355483],
 [#<Linguist::Language name=Dart>, -87.459050333217],
 [#<Linguist::Language name=Coq>, -88.6740351917743],
 [#<Linguist::Language name=Rust>, -93.09294395196528],
 [#<Linguist::Language name=Nemerle>, -93.21419319559817],
 [#<Linguist::Language name=PowerShell>, -93.51902834727619],
 [#<Linguist::Language name=Arduino>, -93.5392310545937],
 [#<Linguist::Language name=Opa>, -93.78609113252523],
 [#<Linguist::Language name=XQuery>, -93.83645881136175],
 [#<Linguist::Language name=R>, -94.21217552783614],
 [#<Linguist::Language name=Delphi>, -94.35016127081002],
 [#<Linguist::Language name=SuperCollider>, -94.40855958019455],
 [#<Linguist::Language name=Verilog>, -94.8229388269385],
 [#<Linguist::Language name=OpenCL>, -96.50244013644215],
 [#<Linguist::Language name=Groovy Server Pages>, -96.56948552051941],
 [#<Linguist::Language name=Racket>, -97.8652823987905],
 [#<Linguist::Language name=OCaml>, -99.6352432871025],
 [#<Linguist::Language name=Matlab>, -101.76930665936734],
 [#<Linguist::Language name=XML>, -101.8170795450655],
 [#<Linguist::Language name=Haml>, -102.25666430330622],
 [#<Linguist::Language name=Scilab>, -102.64814316943966],
 [#<Linguist::Language name=INI>, -102.66212941141441],
 [#<Linguist::Language name=Logtalk>, -103.5329577692118],
 [#<Linguist::Language name=GAS>, -103.96895960118005],
 [#<Linguist::Language name=Sass>, -104.20257445236155],
 [#<Linguist::Language name=Turing>, -104.82161366076778],
 [#<Linguist::Language name=OpenEdge ABL>, -105.1428606897919],
 [#<Linguist::Language name=VimL>, -112.11353183520714],
 [#<Linguist::Language name=Standard ML>, -112.11353183520714],
 [#<Linguist::Language name=Nu>, -112.80667901576709],
 [#<Linguist::Language name=Parrot Assembly>, -112.80667901576709],
 [#<Linguist::Language name=Scheme>, -112.80667901576709],
 [#<Linguist::Language name=Julia>, -112.80667901576709],
 [#<Linguist::Language name=Ioke>, -112.80667901576709],
 [#<Linguist::Language name=Rebol>, -112.80667901576709],
 [#<Linguist::Language name=Parrot Internal Representation>,  -112.80667901576709],
 [#<Linguist::Language name=Emacs Lisp>, -112.80667901576709],
 [#<Linguist::Language name=Tea>, -112.80667901576709],
 [#<Linguist::Language name=Nimrod>, -112.80667901576709],
 [#<Linguist::Language name=VHDL>, -112.80667901576709],
 [#<Linguist::Language name=Diff>, -112.80667901576709],
 [#<Linguist::Language name=Markdown>, -112.80667901576709],
 [#<Linguist::Language name=Visual Basic>, -112.80667901576709],
 [#<Linguist::Language name=Prolog>, -112.80667901576709],
 [#<Linguist::Language name=AutoHotkey>, -112.80667901576709],
 [#<Linguist::Language name=XSLT>, -112.80667901576709],
 [#<Linguist::Language name=YAML>, -112.80667901576709]]

Still the results are in the correct order...

ruby --version
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin11.4.0]

The same behavior on x86_64 linux.

Nimrod .nim files no longer are recognized after some change to linguist

ugh, anyone with Ruby experience want to figure out why github's linguist does not consider .nim files to be the Nimrod language anymore?

I'm quite sure it fails on my comp because I have the latest Ruby version and it doesn't support it.

I don't know what I need, all I want is to get linguist to run.

I also noticed that linguist fails with an error:

custom_require.rb:36:in `require': cannot load such file -- pygments (LoadError)

But I can't find any "gem install pygments"

Do people really HAVE to use bundler in order to try linguist? I don't like bundler
at all, it messes up things in ways I don't want to. :(

All we have to find out is why linguist no longer recognizes .nim files

Nimrod:
type: programming
color: "#37775b"
primary_extension: .nim
extensions:

  • .nimrod

It should work but it does not.

(.nim are default extensions for nimrod files)

Ship public gem

Theres already a linguist gem, we'll take github-linguist.

Ruby 1.9.2: file content encoding causes file blobs to fail

The creation of file blobs can fail on creation because the file contents might be encoded. This issue should only be present in Ruby 1.9+ as Ruby 1.8 did not care for encoded files.

A tempory solution is to do this in the file_blob.rb

    # Public: Read file contents.
    #
    # Returns a String.
    def data
      File.read(@path).encoding.to_s
    end

Only thing is the test cases fail now.

Note: If this project was only intended to only work with Ruby 1.8, then disregard this

Matlab extension .m

I've seen you consider Matlab's extension as .matlab, however it is popular to use .m (one of the standard extensions).

I know this conflicts with Objective-C's m files, but it would be interesting to have an option to make syntax checks to guess the extension in dubious cases.

This is confusing to me, as I have both Objective-C and Matlab repositories.

Invalid gemspec (missing authors)

I'm receiving the following error when I try to install linguist via bundle:

linguist at /usr/lib64/ruby/gems/1.9.1/bundler/gems/linguist-d8903afc12b1 did not have a valid gemspec.
This prevents bundler from installing bins or native extensions, but that may not affect its functionality.
The validation message from Rubygems was:
authors may not be empty

If I clone linguist locally and add an authors line to the .gemspec file, it works fine.

I'm on ruby 1.9.1

`startinline` option for PHP highlighting

I didn't see a way to pass options to each lexer from languages.yml but it would be great to have the startinline option in Pygments turned on for PHP. See Lexars for web-related languages and markup under PhpLexer:

startinline
If given and True the lexer starts highlighting with php code (i.e.: no starting <?php required).
The default is False.

Ideally, this sample snippet of PHP code from the Symfony2 project would be highlighted with ```php without having to include <?php:

/**
 * Client simulates a browser and makes requests to a Kernel object.
 *
 * @author Fabien Potencier <[email protected]>
 *
 * @api
 */
class Client extends BaseClient
{
    protected $kernel;

    /**
     * Constructor.
     *
     * @param HttpKernelInterface $kernel    An HttpKernel instance
     * @param array               $server    The server parameters (equivalent of $_SERVER)
     * @param History             $history   A History instance to store the browser history
     * @param CookieJar           $cookieJar A CookieJar instance to store the cookies
     */
    public function __construct(HttpKernelInterface $kernel, array $server = array(), History $history = null, CookieJar $cookieJar = null)
    {
        $this->kernel = $kernel;

        parent::__construct($server, $history, $cookieJar);

        $this->followRedirects = false;
    }
}

C code detected as Objective C

Hello,

I have a repository in Github, the Refu Library, which is a pure C project. For some reason the majority of the source files are identified as Objective C and so the project itself is tagged as Objective C. Here is the repository:
http://github.com/LefterisJP/Refu/

I have no knowledge of Ruby so I can't understand how the Linguist project works to find the problem. Any assistance with this matter will be appreciated.

Doesn't Detect Languages

My Github repo is not getting any graph data. This is built on approx 98% PHP and a little bit of Javascript. Not sure why I am not getting stats anymore (I used to).

-Chris

Prolog files misclassified as Perl files

Prolog files are once again misclassified as Perl files. The disambiguation code seems to have been removed. The current specs for Prolog defines "primary_extension" as ".prolog", which nobody in the Prolog programming community uses and ever used. The default extension for Prolog is ".pl" (long before Perl ever existed). How to get the disambiguation functionality back?

foundation detected as PHP

foundation detected as ~75% php.

But php files in foundation use a lot of php and one to three php instructions.

It should be detected as ~70% html and ~5% php

Pull Request Failure

Travisbot failed this request: #216

To be honest, fairly new to Github and while it looked like contributing to linguist would prove straightforward, something has clearly gone awry. Any idea what?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.