elijah-potter / harper Goto Github PK

The Grammar Checker for Developers

License: Apache License 2.0

Rust 76.74% JavaScript 1.25% CSS 0.37% HTML 0.37% Svelte 6.86% TypeScript 12.95% Dockerfile 0.40% C++ 0.12% Lua 0.17% Ruby 0.14% Just 0.64%

harper's Introduction

Allow Me Introduce Myself

I am currently a computer science student at the Colorado School of Mines, as well as a software developer at Tyler Technologies. When I'm not occupied by those things, I'm usually working on Harper, a grammar checker for developers.

Most of my writing ends up on my personal website.

In case you're interested, my other recent side projects include: The Thrax Programming Language, a "Hope O' Meter", a number of short videos and a software renderer I wrote from scratch.

harper's People

Contributors

Stargazers

Watchers

Forkers

parkerbedlan szabgab lukasmwerner

harper's Issues

feat: `harper-ls` Should Provide Quickfixes

Exactly as the title says.

The LSP implementation should allow users to view and apply Harper suggestions as quickfixes.

bug: + not recognized as a valid work

When using the + symbol for math equations inside a markdown file I get the following fixes given the following input:

T(n) = T(n-1) + 1

Fixes:

1. Replace with: "0"
2. Replace with: "CD"
3. Replace with: "0"

feat: Revise spellcheck suggestion ordering

The current spellcheck suggestions are sorted based on the edit distance to the provided word. This works, but there are a few practical issues.

While there are a lot of possible ways to improve this, I first want to try simply prioritizing longer words. Most spelling errors seem to be omissions, rather than additions or replacements.

Sometime down the line, we can prioritize suggestions based on Google's 1-gram data.
This would not require including the 1-gram data in Harper, rather finding the frequencies of the words that are already in our list.

`[lspconfig] Cannot access configuration for harper_ls.`

I am trying to add harper_ls to Neovim, but I am getting this error whenever I start Neovim:

[lspconfig] Cannot access configuration for harper_ls. Ensure this server is listed in `server_configurations.md` or added as a custom server.

Here is the relevant section of my config (using lazy.nvim) (-- ... means omitted code):

-- lsp-zero.lua
return {
    "VonHeikemen/lsp-zero.nvim",
    -- ...
    config = function()
        -- ...
        local lsp_conf = require("lspconfig")
        -- ...
        lsp_conf.harper_ls.setup({
            settings = {
                ["harper-ls"] = {
                    linters = {
                        spell_check = true,
                        spelled_numbers = false,
                        an_a = true,
                        sentence_capitalization = true,
                        unclosed_quotes = true,
                        wrong_quotes = false,
                        long_sentences = true,
                        repeated_words = true,
                        spaces = true,
                        matcher = true
                    }
                }
            }
        })
        -- ...
    end
}

feat: identify emojis as separate token

This is very low priority. We would only want to specifically identify emojis if we were to create a lint around them... which doesn't sound incredibly useful.

feat: Create map between matching word sets.

These will be curated manually.

For examples:

there fore -> therefore

feat: Should Parse Markdown Correctly

Currently, Harper only parses plain English properly.

This results in errors similar to the following, where the word, "adore" throws an error, even though it is spelled completely correctly.

I __adore__ this cupcake.

In order to parse markdown properly, we need to generate tokens in the same way we generate tokens for plain english, while ignoring the markdown additions.

I believe we can do this through the use of pulldown_cmark.

feat: detect missing spaces

If the document contains a large, misspelled "word," like nospaces, we should do a best guess at where we could insert a space to make it two valid words.

bug: checking go directives

When using Harper on Go codebases it would be nice to ignore checking Go comments that start with //go:* as these comments are directives for the compiler. The following is a example:

//go:embed templates/login.html
var LoginPageHTML string

The documentation on this Go feature is here

bug: `harper-ls` will not publish diagnostics except on save

See title. As soon as someone opens a document, the diagnostics should be shown.

feat: Anaphora checking

Not sure what we can do here. There are some situations where repetition of the same word at the beginning of a sentence is a literary device (anaphora), and others where it comes off as word and wrong.

How can we notify the user when they do that latter?

feat: Implement parsing of Hunspell dictionaries

Overview

There are several open source dictionaries available in the hunspell *.dict and *.aff formats. Notably, there are a good many here.

Why?

Right now, the main problem with the spellchecker is the available word list.
The current one, english_words.txt, has too many words.
Not only that, but the word list also contains a lot of "words" that don't seem to be part of the standard English lexicon (e.g. "aarp").

By enabling Harper to use Hunspell dictionaries, we can lean on the existing curation.

The Formats

Source

`*.dict` File

The *.dict file is extremely similar in usage to our existing english_words.txt file.
The main difference is the addition of the / separated postfixes that provide additional information about each word.
These postfixes allow Hunspell to ship a relatively small word set, and expand it at runtime.

This file technically act as a drop-in replacement for the existing wordlist, but certain words will be marked as invalid, since we wouldn't be processing the postfixes.
For example, "there" would be marked as valid, but "there's" would not.

`*.aff` File

The affix file define how the postfixes described above should be expanded.
Right now, we do not intend to support the entire *.aff file format, just enough to fit our needs with a specific dictionary. For example, we will ignore the encoding setting and assume all dictionaries are UTF-8.
We will also (at least initially) not support compounding.

diagnostics dont refresh on command submission

btw

bug: Harper lints inline math

Right now Harper throws a hissy fit every time inline math is used. Inline math should be ignored by Harper.

I've been meaning to fix this issue for quite some time now. When inspecting Markdown, Harper currently uses pulldown_cmark, which doesn't presently support math.

However, they recently merged support into the branch for the upcoming release. Before we can include the changes, we need that version of pulldown_cmark to be on crates.io.

The merged request

bug: table cells are not treated as end of sentence

Maybe you can classify it as not a bug, but my understanding of a cell in markdown is that the sentence is implicitly ended whether or not there is a period.

Example: harper-ls will complain that the bottom half of this table is a sentence which is too long

| Key        | Action                                                                 |
| ---------- | ---------------------------------------------------------------------- |
| `j`        | Scroll down                                                            |
| `k`        | Scroll up                                                              |
| `l`        | Scroll one page down                                                   |
| `h`        | Scroll one page up                                                     |
| `r`        | Reload file                                                            |
| `f` or `/` | Search                                                                 |
| `n` or `N` | Jump to next or previous search result                                 |
| `s` or `S` | Enter select link mode. Different selection strategy.                  |
| `Enter`    | Select. Depending on which mode it can: open file, select link, search |
| `Esc`      | Go back to _normal_ mode                                               |
| `t`        | Go back to files                                                       |
| `b`        | Go back to previous file (file tree if no previous file)               |
| `g`        | Go to top of file                                                      |
| `G`        | Go to bottom of the file                                               |
| `d`        | Go down half a page                                                    |
| `u`        | Go up half a page                                                      |
| `q`        | Quit the application                                                   |

bug: Sentences are not terminated by paragraph breaks

This is true of the broader sentence identifier. The most pressing issue is that the Markdown parser does not insert newlines at the end of headings, etc.

bug: All periods are considered sentence terminators.

Should be self explanatory.

For example, the e.g. is flagged:

You can only delete an item if it is not being used by any other item in your workspace (e.g. as a subroutine).

feat: `a vs "an"

As the title says. Harper should be able to check and provide suggestions to fix improper use of "a" vs "an" depending on the succeeding word.

bug: Overlapping lints in web UI mess with formatting

As the title says

"thread 'main' panicked at [...] when slicing

Hi! Thanks for this awesome project!

I got this panic in a Lua file when using Harper 0.6.2 as a language server:

[ERROR][2024-02-16 20:27:27] .../vim/lsp/rpc.lua:796	"rpc"	"/home/melker/.cargo/bin/harper-ls"	"stderr"	"thread 'main' panicked at /home/melker/.cargo/registry/src/index.crates.io-6f17d22bba15001f/harper-ls-0.6.2/src/tree_sitter_parser.rs:185:32:\nbegin <= end (23 <= 14) when slicing `-----------\n-- Noice --\n-----------\n`\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace\n"

Here's what the file looks like (stripped down, but still causing the error):

-----------
-- Noice --
-----------

bug: Tree-Sitter parser fails to join lines of comments

When a comment spans multiple lines, the current Tree-Sitter parser parses them as separate lines.

add to mason registry

Seeing that nvim-lspconfig already has a config for harper, it should also be added to mason as a common method for installing packages in nvim.

bug: parsing error on words containing non English characters

If I try to write a name of something containing non English characters, harper-ls tries to spellcheck the substring after the special character. I don't expect it to recognise names from other languages since it's meant for English, but it hinders me from adding the whole word to the global/local dictionary using my editor's quick fix.

Example:
I try to write the word Løvetann. harper-ls asks if I ment to spell vetann this way, and the quick fix option is to add vetann to the dictionary.

Requested behaviour:
I would like harper-ls to mark the whole word, and give me the option to add Løvetann to the global/local dictionary.

feat(harper_ls): add code action to add misspelled word to document's dictionary

Exactly as the title says. I believe that dictionary additions should be document-specific. In other words, additions to the dictionary should not effect other documents.

If we want to be able to add to a global dictionary, that should be a separate code action.

License

Hi! This repo doesn't have a license, which means that it's techincally not FOSS. If your intention is for it to be FOSS, consider adding a license :)

bug: `harper-ls` runs on code blocks in Markdown

Just as the title says. Code blocks should be ignored.

bug: Contractions are marked as a single token

Harper currently marks contractions as a single token, rather than three.

For example: you'll should be marked as you ' ll. Similarly, ain't should be marked as ain ' t.
Where ' is a punctuation token.
This is with the intention of running a special spellchecking linter for contractions.
They should not be handled by the generalized spellchecking linter.

This is related to #6.

bug: When sentences start with inline code blocks, linters don't include them in sentences.

Which makes sense, since they aren't even included in the token list. I imagine we can fix this by creating a special ignore token and parsing unwanted blocks as these.

feat: "a" vs "an"

As the title says. Harper should be able to check and provide suggestions to fix improper use of "a" vs "an" depending on the succeeding word.

feat: URLS should be lexed as their own token

Similar to #25

bug: C++ multi line comments with asterisks aren't respected

For example:

  /***
   * This is an example of the error:
   * this line is considered a new sentence.
   */

FR: custom dictionary path

Right now, the location of the custom dictionary path is static. It would be useful to name a custom location, e.g. for syncing purposes.

bug: a/an choice determined by letter consonant not phonetic consonant in acronyms

Example: an LLM is valid, but haper suggests a LLM because L is a consonant, even though phonetically it begins with a vowel.

Probably not fixable.

Feature request: `diagnosticSeverity` support

Hi! I think it would be great if harper supported a user specified diagnosticSeverity 🙂

feat: spellcheck identifier declarations

As the title says.
Make sure you split up the component words of camel or snake case.
Might need to auto-detect the style of the file.

Build from crates.io fails

Looks to be a semver problem in the deps?
(doesnt occur when building from git)

   Compiling harper-ls v0.8.1
error[E0308]: `match` arms have incompatible types
  --> /home/jayvdb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/harper-ls-0.8.1/src/tree_sitter_parser.rs:30:24
   |
18 |           let language = match file_extension {
   |  ________________________-
19 | |             "rs" => tree_sitter_rust::language(),
20 | |             "tsx" => tree_sitter_typescript::language_tsx(),
21 | |             "ts" => tree_sitter_typescript::language_typescript(),
...  |
29 | |             "rb" => tree_sitter_ruby::language(),
   | |                     ---------------------------- this and all prior arms are found to be of type `tree_sitter::Language`
30 | |             "swift" => tree_sitter_swift::language(),
   | |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `tree_sitter::Language`, found a different `tree_sitter::Language`
...  |
34 | |             _ => return None
35 | |         };
   | |_________- `match` arms have incompatible types
   |
   = note: `tree_sitter::Language` and `tree_sitter::Language` have similar names, but are actually distinct types
note: `tree_sitter::Language` is defined in crate `tree_sitter`
  --> /home/jayvdb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tree-sitter-0.22.1/binding_rust/lib.rs:55:1
   |
55 | pub struct Language(*const ffi::TSLanguage);
   | ^^^^^^^^^^^^^^^^^^^
note: `tree_sitter::Language` is defined in crate `tree_sitter`
  --> /home/jayvdb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tree-sitter-0.20.10/binding_rust/lib.rs:43:1
   |
43 | pub struct Language(*const ffi::TSLanguage);
   | ^^^^^^^^^^^^^^^^^^^
   = note: perhaps two different versions of crate `tree_sitter` are being used?

bug: Sentence parser does not include final quotation mark

The current sentence implementation does not consider that quotation marks can appear after the sentence terminator. We should consider this final quotation as part of the sentence.

For example:

She said, "There is no way this is true."

feat: Include spaces after commas

Include a checker that ensures there is exactly one space after every comma. Quotes complicate things, since spaces should come after the quote.

Test cases:

hello world,my friend

"Hello,"my friend said.

feat: Spell check content of html tags?

Would be cool to be able to check the contents of html 😉

<p>this should be checked</p>

bug: improve cross-case spellcheck.

As of right now, these words are marked incorrect (which is valid), but no suggestions are provided due to improper handling of capitalization.

ymca
Ymca

feat: Detect and repair multiple sequential pronouns.

For example:

...little bit about my I want to do.

my and I can never be next to each other.

feat: Spell checker should expand search if no words are found

When the spell checker encounters especially long, incorrectly spelled words, it fails to provide any suggestions.

For example: algorithmically gives no suggestions. Algorithmically is now in the dictionary, but when it wasn't it was marked as incorrectly spelled.

This can be fixed by gradually expanding beyond the max_edit_dist until a word is found.

feat: Capitalize Common Proper Nouns

Harper should detect and repair uncapitalized proper nouns and brand names.

Examples:

youtube -> YouTube
youTube -> YouTube
Youtube -> YouTube
china -> China
united States -> United States
United states -> United States

feat: ignore shebangs

Harper should not lint shebangs found at the start of files.

Examples include:

#! /usr/bin/ruby

#! /bin/bash

feat: Check if capitalization does not match the dictionary for any part other than the first letter

Right now, Harper marks the following words as "OK", when they really shouldn't be.

NuMber

doOr

LaTeX Support

feat: Detect repetition of common words

Harper should include a linting rule that detects and repairs incorrect repetition of common words.

Examples that should throw a lint:

She lifted the the rock.

I will will do it later.

Examples that should not throw a lint:

This is very very difficult.

While the above example could be grammatically improved by removing the repetition, it is not a grammatical error. This improvement should be a separate lint.

feat: lex email addresses as own token

Harper should lex or parse items that match the email address schema as a separate token.

bug: Spell checker runs on number suffixes

The spell checker should not need to run on number suffixes.

For example:

Ideally, all of them will be completed before August 12th.

Currently, Harper flags the -th as an error.

bug: & not recognized as a valid word

using the & character in a sentence should either be fine or be replaced with "and"

elijah-potter / harper Goto Github PK

harper's Introduction

Allow Me Introduce Myself

harper's People

Contributors

Stargazers

Watchers

Forkers

harper's Issues

Overview

Why?

The Formats

*.dict File

*.aff File

Recommend Projects

Recommend Topics

Recommend Org

`*.dict` File

`*.aff` File