cloudcannon / pagefind Goto Github PK

Static low-bandwidth search at scale

License: MIT License

Gherkin 15.56% Rust 72.64% JavaScript 4.86% Shell 0.23% HTML 0.16% CSS 1.55% Svelte 2.30% TypeScript 2.70%

pagefind's Introduction

Pagefind

Pagefind is a fully static search library that aims to perform well on large sites, while using as little of your users’ bandwidth as possible, and without hosting any infrastructure.

The full documentation on using Pagefind can be found at https://pagefind.app/.

pagefind's People

Contributors

Stargazers

Watchers

Forkers

designdavenz bglw tbroadley brenthuisman isgasho coliff vanbroup 4dcu-be gabssnake simonwiles littlekign chinayuan jackwiy byog tjxj carrey-k sanyuesiyuewuyue joachimesque pvillaverde silky istr tylermercer joonas blueteamer kenpetti-toasttab addisonschultz atqq nixentric fupete digitlib iq-scm delucis apjanco stefanprobst shishkin bennypowers taylanbildik sanjaiyan-dev jraby jothsa david-large tonywhite11 nanderoo richardsonjf gmh5225 partnerise alejandrosuarez designtips danpls deining yoda-soda uiforks diomed theinternauts sthagen adamlaki nirtamir2 marufmax olets crazypeace aperera devdoshi leapon teleki mrjbq7 srapilly vladdnepr saber-kurama jermanuts dscho mateesville93 dallyh sinhalite atomarna torchtree jzombie jpggithub goudarz hamano yonasbsd fgyweb tenax66 nhoizey dasbando

pagefind's Issues

Words with accented letters

It appears that words with accented letters aren’t indexed. I searched for “Régis” as both “Régis” and “Regis” — neither returned “Régis.” Same for “Bjørn” — neither “Bjørn” nor “Bjorn” would return “Bjørn.” Just in case it was simply ignoring the accented letters, I also tried “Rgis” and “Bjrn” (respectively), and neither worked.

Allow using custom Pagefind UI strings

Situations where HTML isn’t well-formed

This is more of an FYI than a true issue report.

In testing Pagefind with the Astro SSG, I’ve seen that it can’t index pages with malformed HTML even though browsers can manage to display the HTML normally. For example, Astro sometimes generates pages lacking the wrapping <html></html> (and Pagefind fails to index them, in my testing) even though the HTML does manage to have <body></body>. (Their team is aware of the issue, and I’ve seen similar such problems with Astro-generated HTML over the last few months. They seem to come and go.)

I’ve had no such problem with Hugo, where the HTML is always correct.

Haven’t tried Pagefind with Eleventy yet but, in two-and-a-half years of off-and-on use of Eleventy, I never saw it produce malformed HTML, so am guessing Pagefind would be fine with it, too.

Add base url configuration

On either the indexing or the client side, a baseurl should be able to be provided so that search results are prepended correctly.

Stretch concept:
By default, Pagefind could try to prepend a detected baseurl from the _pagefind folder location.

Improve contextual ranking

Pagefind ranks results based on term-frequency in the document. This should be improved by ranking the distance of search terms, so that cat hat returns the cat in the hat above the cat wasn't particularly related to the hat.

(i.e. cat . . hat > cat . . . . . hat)

Allow mounting search field to existing input

In some cases a search field may already exist and be very carefully styled to match the site. In this case it would be great to have an option to tell the Pagefind UI to use the existing static input field, rather than the default behavior of dynamically creating a new one. This could work very well in conjunction with the #53 feature request.

Maybe something like:

new PagefindUI({
    element-static: "#search"
});

when you want to mount to an existing static search field.

(PagefindUI) Ensure search results are in view on mobile

Hey!

We noticed that on mobile, most users aren't able to view the search results when the "Filter" categories are too many. (Because the search results get displayed below their viewport height on mobile). So we've been experimenting letting the user's opt in to view the filters rather than open the filter options each time they search for something.

For now, we've just added an event listener to the "#search" element bar that just hides the filter options, but thought this might be a good suggestion?

// Close Filter Each Time Searchbox Is Updated
document.querySelector(".pagefind-ui__search-input").addEventListener("input", function(){
   // Ensure the filter on the dom has attribute 'open', and that the user hasn't checked any filter options (they opted in to filtering)  
   if (document.querySelector(".pagefind-ui__filter-block").getAttribute("open") != null && document.querySelector(".pagefind-ui__filter-value--checked") == null){
                 // Click on the filter toggle.
                document.querySelector(".pagefind-ui__filter-name").click()
    }
})

Hide empty filters

Currently the filter in the search results shows many values with no results Value (0), it would make it more user friendly if those items would be hidden automatically:

HTML entities in search results

I have a prototype search feature for my website using pagefind, https://dotat.at/search.html. It's nice and whizzy, and it fits in well with my Rust static site generator. Thanks for making pagefind!

The only significant problem is that HTML entities in page titles are escaped, so my results page displays them like

 2022-04-20 &ndash; really divisionless random numbers

Entities in page bodies are not escaped, so if you search (for example) for nbsp, you get a lot of highlighted spaces in the results. This is probably a bug but it isn't a showstopper for me.

Hyphenated phrases

I tried searching for hyphenated words and phrases both with and without surrounding quotes — e.g., Go-based and "Go-based" — but this doesn’t work with 0.5.3. The result shows up with zero hits. I don't know whether the ignoring of hyphens is on purpose but, if not, just FYI.

Allow mounting results pane separately

Currently the search results pane (in the included Pagefind UI) is tied to the search input; it would be nice if it could be (optionally) placed separately.

Maybe something like:

new PagefindUI({
    element: "#search",
    results: "#search_results"
});

when you don't want the default current behavior.

Image not using availible space in preview

Auto scaling SVG images (without defined size) are not taking the availible space in the search result. In the example below you can clearly see that the first image (SVG) is much smaller than the second image (JPG) while the image could easily be shown bigger.

Screenshot taken from: https://pkic.org/blog/

Direct image link:
https://pkic.org/images/members/entrust/entrust.svg

Implement local development mode

When developing locally, Pagefind could watch and host a directory, so that a new SSG build would get picked up and indexed automatically.

Allow "Filters" in PagefindUI?

Hey!

I can't begin to say how helpful pagefind has been for our projects, incredibly grateful that this was made possible :)

My suggestion or question is really simple! Is there a way I can enable the "filters" in PagefindUI() options? I want to limit the search responses to the current section only. I.e when a user searches "Events" they shouldn't be shown blogs.

Basically allowing us to extend pagefind options into the PagefindUI() configuration. Something like:

new PagefindUI({ 
            element: "#search",
            filters: {
                Type: ["Events"],
            }
        });

I'd love to know if this is possible!

Use relative URLs for images

The search results currently use an absolute URL for images (but not for links), causing more bytes to be transferred and requiring the site URL itself to be added to the img-src Content-Security-Policy when testing locally.

Avoid indexing html encoded characters as text

Currently some instances of quotes as " can be found via searching for quot — these entities should be normalized.

license?

what's the license of this project?

Change snowball implementation

The web library currently uses a Rust snowball implementation for stemming, but it looks like the JS implementation would come in much smaller and reduce the WASM filesize.

Spellcheck

Pagefind does not currently provide a spellcheck, to keep the network resource size down. Proposal:

Identify high-value unique words on the site
Add those words into the metadata index
Provide spellcheck amongst that set of results

Cannot install package

I'm building in my package.json like this:

  "scripts": {
    "all": "npm run build && npm run postbuild",
    "build": "npx @11ty/eleventy",
    "postbuild": "npx pagefind --source _site"
  },

and then npm run all --serve

I'm doing this locally, in dev mode, at the moment, and here's the output:

> [email protected] all
> npm run build && npm run postbuild


> [email protected] build
> npx @11ty/eleventy

Writing _site/... etc etc
Copied 841 files / Wrote 72 files in 25.28 seconds (351.1ms each, v0.12.1)

> [email protected] postbuild
> npx pagefind --source _site

Need to install the following packages:
  pagefind
Ok to proceed? (y) y

> [email protected] postbuild
> npx pagefind --source _site

And it stops like this. I do not see any Pagefind output of indexed and created, and no /_pagefind/ directory is in my _site.

I do have the Pagefind search UI in my search page:

          <div id="search"></div>
          <script>
              window.addEventListener('DOMContentLoaded', (event) => {
                  new PagefindUI({ element: "#search",
                  showImages: false,
                  });
              });
          </script>

and I just omitted <link href="/_pagefind/pagefind-ui.css" rel="stylesheet">;
while <script src="/_pagefind/pagefind-ui.js"></script> is in my 11ty _includes layout with other JS that work fine.

I've also tried to run separately in my Windows Power Shell npx -y pagefind --source _site --serve but no success:

What am I missing? Thanks!

Support for fully offline search like PWAs

Pagefind is a great project and a boon to static/JAM stack websites.

I run a blog that is also a Progressive Web Application (PWA) and I thought as page find is fully client side it would work offline for a PWA. Unfortunately, that was not the case, the main blocker I found was the metadata timestamp call. I have made sure that all the .pf_fragment, .pf_index, .pagefind, pf_meta files are cached by the service worker, still when the server is not there, the /_pagefind/pagefind.pf_meta?ts={current-ts} fails and the search fails too. Is there a way to also be cached by the service worker so that the search works offline? May be not the latest index but whatever was cached. I may be oversimplifying it.

Patch fix

If I force remove the ?ts={current-ts} it kinda works fully offline, is there a way to remove it with a flag or something?

Handle complex bcp-47 language codes automatically

For example, zh-Hans-tw needs to be matched to our zh-tw UI translations.

ReferenceError: url is not defined with Content-Security-Policy enabled

I'm trying to enable pagefind with Content-Security-Policy enabled and run into the following error:

pagefind.js:1 
  Uncaught (in promise) ReferenceError: url is not defined
    at Pagefind.loadWasm (pagefind.js:1:12922)
    at async Promise.all (/blog/index 0)
    at async Pagefind.init (pagefind.js:1:12207)
loadWasm @ pagefind.js:1
await in loadWasm (async)
Pagefind @ pagefind.js:1
(anonymous) @ pagefind.js:1

Hugo config:

server:
  headers:
  - for: /**
    values:
      X-Frame-Options: DENY
      X-Content-Type-Options: nosniff
      Referrer-Policy: strict-origin-when-cross-origin
      Permissions-Policy: document-domain=()
      Content-Security-Policy: default-src 'none'; img-src 'self' data:; form-action 'self'; base-uri 'self'; 
          block-all-mixed-content;
          style-src 'unsafe-inline' 'self' https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css https://fonts.googleapis.com/css;
          font-src https://fonts.gstatic.com/s/roboto/;
          script-src 'self' https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js;
          frame-src https://www.youtube-nocookie.com/ https://player.vimeo.com/;
          media-src https://i.ytimg.com https://www.rovid.nl/def/dco/2016/def-dco-20160823-idoa9bivg-web-hd.mp4;
          connect-src ws://localhost:1313/livereload 'self';

The connect-src option of the Content-Security-Policy is set to self which permits the script to connect (without this option you would get a policy error).

** copied the /public/_pagefind directory in my static Hugo folder for testing

This error is also live on: https://pkic.org/blog/

When running without Content-Security-Policy using the build in --serve option, the search runs fine:

hugo; ../pagefind --source ./public/  --serve

Return no results if a non-existent filter is provided

Currently filters will be ignored if no index exists for that value.

Windows binary release

It would be great to have a Windows binary release for when updating Hugo pages on a Windows desktop.

Customize the indexed search element

Pagefind currently forcibly indexes the body selector. Content can be cut out using the data-pagefind-ignore attribute, but ideally one could provide a different selector to index from, which would save having to ignore navigations and footers, if the content is reliably located inside an <article>, for example.

Allow for fully-qualified baseURLs in Pagefind search options

See discussion in #49

Feature request: Add support for misspellings

Thank you for creating this open search solution.

Currently, misspelling a search term doesn't return any result. Please add the ability to show results for misspelled words.

Relates to #2

Implement OR filter

Pagefind supports filtering with multiple values:

let search = await pagefind.search("Cat", {
  filters: {
    color: ["Tabby", "Orange"]
  }
});

But currently only implements an AND filter, i.e. cats that are Tabby and Orange. This boolean logic should be expanded.

Once indexes are loaded, searching is extremely fast, so this could be achieved by running a Tabby search followed by an Orange search and smooshing the results together.

CSP and “Load more results” button

As the docs counsel, I set my Content Security Policy’s script-src to unsafe-eval; however, that turned out to be insufficient. (I have put the CSP in Report-Only mode for now.) I also had to use unsafe-inline. That worked fine for the initial five loads but, when I used the Load more results button, I got this back:

[Report Only] Refused to send form data to 'javascript:void(0);' because it violates the following Content Security Policy directive: "form-action 'self' https://duckduckgo.com https://*.duckduckgo.com https://startpage.com https://*.startpage.com".

Short of just not having a form-action entry in the CSP, am not sure how to proceed on that one. My searches don’t seem to give a good answer.

This may be somewhat similar to Issue #21, but am just noting it FYI.

Site: https://www.brycewray.com
Repo: https://github.com/brycewray/hugo_site

P.S. All this aside, very nice job on the fixes between 0.4.1 and 0.5.0! 👍 👍 👍

Search performance on very large sites

Hey guys, I've been checking out Pagefind and it works pretty great! I integrated it with my tiny website and it works pretty much flawlessly. The installation was super easy and the UI is slick, fast and lightweight. Lots of fun.

I went ahead and did some testing with a bigger site. I grabbed the largest static website that came to mind, which is the entirety of the Unity manual. It's actually an excellent comparison case because the docs (both the online and offline version) also use client-side search (that needs to grab an enormous ~10MB index).

The bundle is generated very quickly given how many pages there are to index.

$ C:\src\UnityDocumentation\node_modules\.bin\pagefind --source en
Running Pagefind v0.6.1
Running from: "C:\\src\\UnityDocumentation"
Source:       "en"
Bundle Directory:  "_pagefind"
Walking source directory...
Building search indexes...
Did not find a data-pagefind-body element on the site.
↳ Indexing all <body> elements on the site.
Indexed 29977 pages
Indexed 99142 words
Indexed 0 filters
Created 225 index chunks
Finished in 41.730 seconds
Done in 42.01s.

I deployed a Pagefind version of the docs search here on GitHub Pages and played around with it a little. The amount of data that PageFind needs to transfer is much smaller, so the core concept definitely works great. However, there's a couple of common cases where the search engine freezes up quite heavily.

For example, here's a Firefox profile of a search for the letter "a". The search engine seems to freeze up for a couple of seconds (on a pretty fast desktop machine). Obviously, a lot of search queries will start with that, so it's a bit of an issue. Other degenerate queries include an, t the, c, t, time, etc - anything that's short enough that it will generate lots of hits. (Again, none of this impacts me currently at all, but it could impact someone eventually...)

Same thing if the user types a special character like space, dot or comma - the engine seems to generate tens of thousands of hits for that. I don't think it's very useful that a query like this generates any hits at all, although maybe this is more of a pagefind-ui issue, not sure.

Finally, I wanted to link the index back to the original documentation website at docs.unity3d.com. Sadly, I was unable to set this up. As I understand, this could be eventually solved by #17, although that seems like a solution to a much bigger problem than mine - If I understand correctly, all I really needed is being able to separate the bundlePath from the URL appended to the hit result link. This could probably be achieved by modifying pagefind-ui, but it would be nice to have a built-in way to handle this.

Take a look at the test page if you want to play with it: https://apkd.github.io/pagefind-benchmark

Thanks!

Implement configuration file

Pagefind currently only supports commandline flags, but should move to support configuration files and environment variables as well.

Ignore images within `data-pagefind-ignore`

Currently there is no ability to exclude images selected for the image meta data, this is mainly problematic when there is no image related to the content at which an image from the site design is selected instead.

I tried to exclude these images by putting them in a data-pagefind-ignore container expecting those images to be ignored but without success.

As alternative I'm overruling the image meta data with the automatically generated open graph image, but this prevents any other images from within the content to be selected.

<meta property="og:image" content="{{- $img.Permalink -}}" data-pagefind-meta="image[content]"/>

Support for dark and light mode?

Love this search so far, really easy to integrate! <3

But I didn't find anything for the support of more than one color mode?
Right now it works well for light mode, but the text color is barely readable in dark mode.

So it would be awesome to have support for dark and light mode :)

Cumulative Layout Shift Issues

The example code:

<link href="/_pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/_pagefind/pagefind-ui.js" type="text/javascript"></script>

<div id="search"></div>
<script>
    window.addEventListener('DOMContentLoaded', (event) => {
        new PagefindUI({ element: "#search" });
    });
</script>

Loads the pagefind search bar after the main HTML content has been loaded, which has impacted the CLS score of a project I'm testing this out on. Is there a way to add the generated HTML that pagefind inserts into the search div manually, then allow the PagefindUI class to use that?

Minify JavaScript and CSS

While pagefind.js is minimized, the size of the other assets can be further reduced:

Multiple Site Indexing and Cross-site search.

We have 3 different domains build with different SSG tools. We would like to be able to search on any of the sites and return the best results from any of the others. It would be nice if there was a way for the client side JS lib to have more than one index file to consider when finding hits.

Feature request: don't index stop words

Stop words are frequently occurring, insignificant words. By
not indexing stop words we will be able allow for more precise Result List. This is a common approach in search tools. Would be good to also have this be configurable. As an extension of this feature it would be great if the stop words also handled multilingual sites. I think this feature would also improve the search performance as a side effect?

Delete current search term from search field

Feature request: Would it be possible to have an option to easily delete the current search term from the search field? Maybe a "x" on the far right that deletes the current search term if clicked?

Support files other than *.html

It would be helpful (for me at least) if pagefind were able to index files other than those named *.html.

For long and complicated (20+ year old) reasons the pages built by my static site generator have a '.shtml' extension and pagefind ignores them by default. I've hacked this up locally, but it would be great if you could optionally choose an extension or ideally glob pattern.

Otherwise what a brilliant piece of software!

Implement exact phrase matching

Pagefind has the data required to provide an "exact phrase" search functionality a-la-Goog, where the term is provided in quotes.

Expose a list of filters

Currently the search user must know what filters can be supplied. The Pagefind object should provide a way to get the filters that it has indexed.

i.e.

const pagefind = await import("/_pagefind/pagefind.js");
const filters = await pagefind.filters();

showEmptyFilters not working

showEmptyFilters does not hide empty filter values with the following config:

window.addEventListener('DOMContentLoaded', (event) => {
    new PagefindUI({ 
        element: "#search",  
        baseUrl: "/",
        showEmptyFilters: false
    });
});

Example:
https://pkic.org/blog/

Return information about the search request

Pagefind should return some information alongside the search results:

Suggestion for a better search term
Information about partial matches, i.e. two of three words were found

Internationalization

Pagefind is currently using the English stemmer — this should be made configurable and ideally auto-detected.

If an internationalized website in folders is detected, Pagefind should run itself once for each language directory, using that language as the configuration.

Doesn't work with mod_mime_magic

If mod_mime_magic is enabled on an Apache server (it's enabled by default on RedHat based systems) then the pf_meta file fails to load.

Looking deeper into this, it appears as if the magic module is detecting the files are already gzip'd and so added a Content-Encoding: x-gzip header to the response. This appears to tell the browser to automatically gunzip the data because gzs() is being passed the uncompressed data. This, naturally then fails the "is this compressed" test and causes the load to fail.

The simple solution is to disable this module on the server (which may be worth adding to the documentation). But maybe having some way of detecting if the browser has automatically uncompressed the data before the javascript gets to see it (e.g. add a simple 3-byte signature to each file; if that gets seen instead of the gzip header then return the raw datastream).

Add custom weighting functionality

This could involve:

Weighting certain HTML elements above others (i.e. h1 > p)
Providing custom "boosts" to certain pages for certain terms

Avoid returning all results for search terms containing only special characters

Ad discussed in #49

Feature request: ignore certain HTML elements and/or item IDs

In Hugo, the goldmark parser renders a footnote reference (i.e., the <sup>’d footnote number itself within text, as opposed to the actual resulting footnote below the text) as, e.g., <sup id="fnref:1">...</sup>. Since neither Hugo nor goldmark makes it possible to edit the HTML of footnote references* — and, thus, there’s no way to specify data-pagefind-ignore for them — it would be nice if one could make Pagefind ignore certain HTML elements and/or item IDs (in the latter case, perhaps any ID that begins with fnref).

[If one wished to exclude the actual footnotes, that also wouldn’t be available unless Pagefind also had the ability to exclude certain CSS classes (<div class="footnotes" role="doc-endnotes">), so I’m also mentioning that FYI.]

* I searched the Hugo Discourse extensively concerning this item, and it appears it’s a long-standing request (at least 2016) deemed as undoable due to issues relating to both goldmark and Commonmark itself.

Build Pagefind UI

Pagefind should provide a prebuilt UI component that can be dropped into a site.