Code Monkey home page Code Monkey logo

harper's Introduction

Harper

Harper LS Web Precommit Crates.io

Harper is an English grammar checker designed to be just right. I created it after years of dealing with the shortcomings of the competition.

Grammarly was too expensive and too overbearing. Its suggestions lacked context, and were often just plain wrong. Not to mention: it's a privacy nightmare. Everything you write with Grammarly is sent to their servers. Their privacy policy claims they don't sell the data, but that doesn't mean they don't use it to train large language models and god knows what else. Not only that, but the round-trip-time of the network request makes revising your work all the more tedious.

LanguageTool is great, if you have gigabytes of RAM to spare and are willing to download the ~16GB n-gram dataset. Besides the memory requirements, I found LanguageTool too slow: it would take several seconds to lint even a moderate-size document.

That's why I created Harper: it is the grammar checker that fits my needs. Not only does it take milliseconds to lint a document, take less than 1/50th of LanguageTool's memory footprint, but it is also completely private.

Harper is even small enough to load via WebAssembly.

Installation

If you want to use Harper on your machine, you will want to look at the documentation for harper-ls, the Language Server Protocol implementation.

Performance Issues

We consider long lint times bugs. If you encounter any significant performance issues, please create an issue on the topic.

If you find a fix to any performance issue, we are open the contribution. Just make sure to read our contribution guidelines first.

harper's People

Contributors

elijah-potter avatar dependabot[bot] avatar szabgab avatar lukasmwerner avatar

Stargazers

seolhokim avatar vinicius avatar Artem Kotov avatar Erik Terpstra avatar Alexander Nikulin avatar Frank Röder avatar Aerith Netzer avatar Kat Nykiel avatar Dominik Gedon avatar Radim Sückr avatar Anyll Markevich avatar  avatar Ludwig Austermann avatar Silvano Sallese avatar Jake Langford avatar Clay Dugo avatar Tripurari Shankar avatar James Martindale avatar David avatar Juraj avatar Frederik Bosch avatar Yahya EL Ganayni avatar Eugene Oliveros avatar Eric Dattore avatar Mads Hougesen avatar Ali Aminfar avatar Erik Kinnunen avatar Fabian Holler avatar panproteus avatar  avatar Dan F. avatar  avatar mmmotion avatar Mahmoud Ashraf avatar Matin Zivdar avatar  avatar Roozbeh Sharifnasab avatar Yong avatar Julien Breux avatar Simon Paitrault avatar Junnun Mohamed Karim avatar Ryan Wong avatar Bryan Hyshka avatar  avatar  avatar  avatar Maxim Belousov avatar Felipe Lima avatar Elias Roa avatar Kyle King avatar Nogweii avatar Dan Martins avatar Rui Chen avatar Max Gorin avatar Alex Vzorov avatar Nerijus Bendžiūnas avatar Ananda Umamil avatar Cameron Dehning avatar marcos ferreira avatar Berk Elyesa Yıldırım avatar  avatar Suchith Sridhar Khajjayam avatar Kevin avatar Mateus Melchiades avatar Ron Green avatar Frank Schoenmann avatar Mike Funk avatar Dmytro Meleshko avatar Štěpán Žák avatar  avatar Daniel Velasquez avatar Alexander Serowy avatar Asif Mahmud Shimon avatar Gabriel B. Santos avatar Emille Henry avatar filip avatar Matheus Gabriel avatar Lê Hiếu avatar Caleb Jasik avatar Adelar da Silva Queiróz avatar Ryan Schlesinger avatar Shreyas Minocha avatar Victor Song avatar oworope avatar George Thayamkery avatar Parker Bedlan avatar Filipe Amaral avatar Guruprakash Rajakkannu avatar Thomas Rainford avatar Konrad Konieczny avatar Stefan VanBuren avatar TheGreatRambler avatar Byron Sharman avatar Redcore avatar  avatar Ebbe Steenhoudt avatar Vlad Liamzin avatar  avatar Dieter Scherer avatar Alok Nigam avatar

Watchers

Lucian avatar  avatar  avatar Anyll Markevich avatar

harper's Issues

feat: ignore shebangs

Harper should not lint shebangs found at the start of files.

Examples include:

#! /usr/bin/ruby
#! /bin/bash

bug: Sentence parser does not include final quotation mark

The current sentence implementation does not consider that quotation marks can appear after the sentence terminator. We should consider this final quotation as part of the sentence.

For example:

She said, "There is no way this is true."

feat: Capitalize Common Proper Nouns

Harper should detect and repair uncapitalized proper nouns and brand names.

Examples:

youtube -> YouTube
youTube -> YouTube
Youtube -> YouTube
china -> China
united States -> United States
United states -> United States

bug: + not recognized as a valid work

When using the + symbol for math equations inside a markdown file I get the following fixes given the following input:

T(n) = T(n-1) + 1

Fixes:

1. Replace with: "0"
2. Replace with: "CD"
3. Replace with: "0" 

feat: Should Parse Markdown Correctly

Currently, Harper only parses plain English properly.

This results in errors similar to the following, where the word, "adore" throws an error, even though it is spelled completely correctly.

I __adore__ this cupcake.

In order to parse markdown properly, we need to generate tokens in the same way we generate tokens for plain english, while ignoring the markdown additions.

I believe we can do this through the use of pulldown_cmark.

License

Hi! This repo doesn't have a license, which means that it's techincally not FOSS. If your intention is for it to be FOSS, consider adding a license :)

FR: custom dictionary path

Right now, the location of the custom dictionary path is static. It would be useful to name a custom location, e.g. for syncing purposes.

bug: parsing error on words containing non English characters

If I try to write a name of something containing non English characters, harper-ls tries to spellcheck the substring after the special character. I don't expect it to recognise names from other languages since it's meant for English, but it hinders me from adding the whole word to the global/local dictionary using my editor's quick fix.

Example:
I try to write the word Løvetann. harper-ls asks if I ment to spell vetann this way, and the quick fix option is to add vetann to the dictionary.

Requested behaviour:
I would like harper-ls to mark the whole word, and give me the option to add Løvetann to the global/local dictionary.

"thread 'main' panicked at [...] when slicing

Hi! Thanks for this awesome project!

I got this panic in a Lua file when using Harper 0.6.2 as a language server:

[ERROR][2024-02-16 20:27:27] .../vim/lsp/rpc.lua:796	"rpc"	"/home/melker/.cargo/bin/harper-ls"	"stderr"	"thread 'main' panicked at /home/melker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/harper-ls-0.6.2/src/tree_sitter_parser.rs:185:32:\nbegin <= end (23 <= 14) when slicing `-----------\n-- Noice --\n-----------\n`\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace\n"

Here's what the file looks like (stripped down, but still causing the error):

-----------
-- Noice --
-----------

bug: improve cross-case spellcheck.

As of right now, these words are marked incorrect (which is valid), but no suggestions are provided due to improper handling of capitalization.

ymca
Ymca

feat: Detect repetition of common words

Harper should include a linting rule that detects and repairs incorrect repetition of common words.

Examples that should throw a lint:

She lifted the the rock.
I will will do it later.

Examples that should not throw a lint:

This is very very difficult.

While the above example could be grammatically improved by removing the repetition, it is not a grammatical error. This improvement should be a separate lint.

bug: Contractions are marked as a single token

Harper currently marks contractions as a single token, rather than three.

For example: you'll should be marked as you ' ll. Similarly, ain't should be marked as ain ' t.
Where ' is a punctuation token.
This is with the intention of running a special spellchecking linter for contractions.
They should not be handled by the generalized spellchecking linter.

This is related to #6.

add to mason registry

Seeing that nvim-lspconfig already has a config for harper, it should also be added to mason as a common method for installing packages in nvim.

Build from crates.io fails

Looks to be a semver problem in the deps?
(doesnt occur when building from git)

   Compiling harper-ls v0.8.1
error[E0308]: `match` arms have incompatible types
  --> /home/jayvdb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/harper-ls-0.8.1/src/tree_sitter_parser.rs:30:24
   |
18 |           let language = match file_extension {
   |  ________________________-
19 | |             "rs" => tree_sitter_rust::language(),
20 | |             "tsx" => tree_sitter_typescript::language_tsx(),
21 | |             "ts" => tree_sitter_typescript::language_typescript(),
...  |
29 | |             "rb" => tree_sitter_ruby::language(),
   | |                     ---------------------------- this and all prior arms are found to be of type `tree_sitter::Language`
30 | |             "swift" => tree_sitter_swift::language(),
   | |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `tree_sitter::Language`, found a different `tree_sitter::Language`
...  |
34 | |             _ => return None
35 | |         };
   | |_________- `match` arms have incompatible types
   |
   = note: `tree_sitter::Language` and `tree_sitter::Language` have similar names, but are actually distinct types
note: `tree_sitter::Language` is defined in crate `tree_sitter`
  --> /home/jayvdb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tree-sitter-0.22.1/binding_rust/lib.rs:55:1
   |
55 | pub struct Language(*const ffi::TSLanguage);
   | ^^^^^^^^^^^^^^^^^^^
note: `tree_sitter::Language` is defined in crate `tree_sitter`
  --> /home/jayvdb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tree-sitter-0.20.10/binding_rust/lib.rs:43:1
   |
43 | pub struct Language(*const ffi::TSLanguage);
   | ^^^^^^^^^^^^^^^^^^^
   = note: perhaps two different versions of crate `tree_sitter` are being used?

feat: `a vs "an"

As the title says. Harper should be able to check and provide suggestions to fix improper use of "a" vs "an" depending on the succeeding word.

feat: Spell checker should expand search if no words are found

When the spell checker encounters especially long, incorrectly spelled words, it fails to provide any suggestions.

For example: algorithmically gives no suggestions. Algorithmically is now in the dictionary, but when it wasn't it was marked as incorrectly spelled.

This can be fixed by gradually expanding beyond the max_edit_dist until a word is found.

feat: Anaphora checking

Not sure what we can do here. There are some situations where repetition of the same word at the beginning of a sentence is a literary device (anaphora), and others where it comes off as word and wrong.

How can we notify the user when they do that latter?

bug: Spell checker runs on number suffixes

The spell checker should not need to run on number suffixes.

For example:

Ideally, all of them will be completed before August 12th.

Currently, Harper flags the -th as an error.

feat: detect missing spaces

If the document contains a large, misspelled "word," like nospaces, we should do a best guess at where we could insert a space to make it two valid words.

bug: checking go directives

When using Harper on Go codebases it would be nice to ignore checking Go comments that start with //go:* as these comments are directives for the compiler. The following is a example:

//go:embed templates/login.html
var LoginPageHTML string

The documentation on this Go feature is here

bug: Harper lints inline math

Right now Harper throws a hissy fit every time inline math is used. Inline math should be ignored by Harper.

I've been meaning to fix this issue for quite some time now. When inspecting Markdown, Harper currently uses pulldown_cmark, which doesn't presently support math.

However, they recently merged support into the branch for the upcoming release. Before we can include the changes, we need that version of pulldown_cmark to be on crates.io.

The merged request

feat: Implement parsing of Hunspell dictionaries

Overview

There are several open source dictionaries available in the hunspell *.dict and *.aff formats. Notably, there are a good many here.

Why?

Right now, the main problem with the spellchecker is the available word list.
The current one, english_words.txt, has too many words.
Not only that, but the word list also contains a lot of "words" that don't seem to be part of the standard English lexicon (e.g. "aarp").

By enabling Harper to use Hunspell dictionaries, we can lean on the existing curation.

The Formats

Source

*.dict File

The *.dict file is extremely similar in usage to our existing english_words.txt file.
The main difference is the addition of the / separated postfixes that provide additional information about each word.
These postfixes allow Hunspell to ship a relatively small word set, and expand it at runtime.

This file technically act as a drop-in replacement for the existing wordlist, but certain words will be marked as invalid, since we wouldn't be processing the postfixes.
For example, "there" would be marked as valid, but "there's" would not.

*.aff File

The affix file define how the postfixes described above should be expanded.
Right now, we do not intend to support the entire *.aff file format, just enough to fit our needs with a specific dictionary. For example, we will ignore the encoding setting and assume all dictionaries are UTF-8.
We will also (at least initially) not support compounding.

bug: table cells are not treated as end of sentence

Maybe you can classify it as not a bug, but my understanding of a cell in markdown is that the sentence is implicitly ended whether or not there is a period.

Example: harper-ls will complain that the bottom half of this table is a sentence which is too long

| Key        | Action                                                                 |
| ---------- | ---------------------------------------------------------------------- |
| `j`        | Scroll down                                                            |
| `k`        | Scroll up                                                              |
| `l`        | Scroll one page down                                                   |
| `h`        | Scroll one page up                                                     |
| `r`        | Reload file                                                            |
| `f` or `/` | Search                                                                 |
| `n` or `N` | Jump to next or previous search result                                 |
| `s` or `S` | Enter select link mode. Different selection strategy.                  |
| `Enter`    | Select. Depending on which mode it can: open file, select link, search |
| `Esc`      | Go back to _normal_ mode                                               |
| `t`        | Go back to files                                                       |
| `b`        | Go back to previous file (file tree if no previous file)               |
| `g`        | Go to top of file                                                      |
| `G`        | Go to bottom of the file                                               |
| `d`        | Go down half a page                                                    |
| `u`        | Go up half a page                                                      |
| `q`        | Quit the application                                                   |

feat: "a" vs "an"

As the title says. Harper should be able to check and provide suggestions to fix improper use of "a" vs "an" depending on the succeeding word.

feat: Include spaces after commas

Include a checker that ensures there is exactly one space after every comma. Quotes complicate things, since spaces should come after the quote.

Test cases:

hello world,my friend
"Hello,"my friend said.

feat: Revise spellcheck suggestion ordering

The current spellcheck suggestions are sorted based on the edit distance to the provided word. This works, but there are a few practical issues.

While there are a lot of possible ways to improve this, I first want to try simply prioritizing longer words. Most spelling errors seem to be omissions, rather than additions or replacements.

Sometime down the line, we can prioritize suggestions based on Google's 1-gram data.
This would not require including the 1-gram data in Harper, rather finding the frequencies of the words that are already in our list.

feat: identify emojis as separate token

This is very low priority. We would only want to specifically identify emojis if we were to create a lint around them... which doesn't sound incredibly useful.

`[lspconfig] Cannot access configuration for harper_ls.`

I am trying to add harper_ls to Neovim, but I am getting this error whenever I start Neovim:

[lspconfig] Cannot access configuration for harper_ls. Ensure this server is listed in `server_configurations.md` or added as a custom server.

Here is the relevant section of my config (using lazy.nvim) (-- ... means omitted code):

-- lsp-zero.lua
return {
    "VonHeikemen/lsp-zero.nvim",
    -- ...
    config = function()
        -- ...
        local lsp_conf = require("lspconfig")
        -- ...
        lsp_conf.harper_ls.setup({
            settings = {
                ["harper-ls"] = {
                    linters = {
                        spell_check = true,
                        spelled_numbers = false,
                        an_a = true,
                        sentence_capitalization = true,
                        unclosed_quotes = true,
                        wrong_quotes = false,
                        long_sentences = true,
                        repeated_words = true,
                        spaces = true,
                        matcher = true
                    }
                }
            }
        })
        -- ...
    end
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.