google / haskell-indexer Goto Github PK

Emits code crossreference data for Haskell sources.

Shell 5.84% Haskell 89.88% Dockerfile 0.47% Nix 3.80%

haskell-indexer's Introduction

The haskell-indexer package provides libs for preprocessing Haskell source code into a representation for easy entity cross-referencing, as well as a frontend for emitting entities for the Kythe indexing schema.

This is not an official Google product.

Supported systems

Indexing hosts:

Linux: supported - follow below documentation.
Windows, MacOS: didn't try - backend part likely compiles, wiring and Kythe frontend likely not (see #38).

Compilers:

GHC 8.6.5
GHC 8.8.1 (planned)

Stackage:

A recent LTS release corresponding to above compilers is supported. See stack-ghcXXX.yml files.

Previous compilers were supported at some point. Checkout an old repository state if interested:

Installation

Stack

Download Stack from http://docs.haskellstack.org

Kythe

If you want to use the Kythe frontend, you'll need to install it either from source or from the official release. The latter is easier, but the web UI has been removed in recent versions.

Official Release

Download a Kythe release and unpack it.

tar xzf kythe-v0.0.26.tar.gz -C /opt/
rm -r /opt/kythe
ln -s /opt/kythe-v0.0.26 /opt/kythe
chmod -R 755 /opt/kythe/web/ui  # It misses permission by default.

Version v0.0.30 is the latest version that includes the web UI. If you want a newer Kythe than this, you'll need to build from source.

If you want to install Kythe in a different location to /opt/kythe then you should also set KYTHE_DIR to the location of the installation.

Building From Source

Clone Kythe from its GitHub repo and follow the Getting Started guide to build and install it into /opt/kythe. Then, from within the Kythe clone, build the web frontend and copy its files into their rightful place:

bazel build //kythe/web/ui
mkdir -p /opt/kythe/web/ui
cp -r bazel-bin/kythe/web/ui/resources/public/* /opt/kythe/web/ui
cp -r kythe/web/ui/resources/public/* /opt/kythe/web/ui
chmod -R 755 /opt/kythe/web/ui

Protoc 3

Download the latest Proto compiler 3 release, unpack it and place the binary in the PATH.

unzip -j protoc-*-linux-x86_64.zip bin/protoc -d /usr/local/bin/

If you use have Nix installed and you use stack --nix, you do not need to do this.

Haskell Indexer Plugin (ghc >= 8.6 only)

Haskell modules can be indexed with a GHC source plugin while building a project. Whatever build system is in use, indexing can be achieved by ensuring that the invocations to ghc include the flags that enable the plugin.

For instruction on how to install and use the plugin with stack, see stack-example/README.md.

If you are using some other build system, the following GHC options are relevant after the plugin is installed.

-package-db <db_path>: Tells the package database where the plugin has been installed. It may be used more than once if the plugin dependencies spread through more than one package database.
-plugin-package haskell-indexer-plugin: Tells ghc to expose the package containing the plugin, so it can be found when needed.
-fplugin Haskell.Indexer.Plugin: Tells to use the plugin when compiling modules.
-fplugin-opt Haskell.Indexer.Plugin:-o and -fplugin-opt Haskell.Indexer.Plugin:<output_path>: Tell the plugin where to place the output of indexing.

Build the project

Use the following to build and run tests:

git clone --recursive https://github.com/google/haskell-indexer.git
cd haskell-indexer
export STACK_YAML=$(readlink -f stack-ghc865.yaml)
stack build && stack test
# To test Kythe frontend:
pushd kythe-verification; stack install && ./test.sh; popd

To test all supported stack configurations, do ./run-ghc-tests.sh.

Demo

To index a few packages, run:

export INDEXER_OUTPUT_DIR=/tmp/indexer-output
./build-stack.sh mtlparse cpu

The script adds a wrapper for the GHC compiler used by Stack (stack path --compiler-exe), does the indexing when ghc --make is specified on the command line to build a package. You can run build-stack.sh multiple times.

To serve the index at http://localhost:8080:

./serve.sh localhost:8080

If you get empty index, look at $INDEXER_OUTPUT_DIR/*.stderr files about possible indexing errors. Also, make sure that the *.entries files are not empty. If they are, it indicates that ghc_kythe_wrapper failed to index.

Indexing using Docker

If you plan to use the Dockerized build feature of stack, please install Docker. It is also advised to set up a docker wrapper script by following the instructions at the stack Docker security section.

The docker image has all C library dependencies so it's possible to use it to index the whole Stackage snapshot. See stack-build-docker.sh for a comprehensive example of indexing a Stackage snapshot, and serving a Kythe index.

haskell-indexer's People

Contributors

Stargazers

Watchers

haskell-indexer's Issues

Improve error messages

Now if API fails, it just throws ExitFailure 1 exception. Instead, print a detailed error message.

Index haskell through a GHC source plugin

Currently, haskell-indexer needs to be integrated with each of various build systems. It needs to be concerned with loading the code with the GHC API with appropriate flags so it can be indexed.

This issue is about indexing from a source plugin instead.

This would remove from haskell-indexer the burden of dealing with the GHC API to parse the code and to load code dependencies. The user would only need to arrange -fplugin Language.Haskell.Indexer to be passed to all invocations of GHC during a regular build of whatever build system she has in place. This should remove most of the build configuration problems that could make haskell-indexer fail, and would probably remove the need to use and maintain ghc wrappers.

Note that there was a former issue (#55) for implementing a frontend plugin, which is something different.

I'm not very acquainted with kythe or haskell-indexer yet, but I'm curious about whether the source plugin idea is as good as it sound from the outside.

Merge pipeline-ghckythe-wrapper into pipeline-ghckythe

It's really just a thin binary wrapper, no need to have it in a separate package.

Warts when (re)indexing packages with Stack.

...contrary to ghc-pkg unregister. Spotted by @MaskRay:

Sometimes stack will copy precompiled packages instead of rebuilding (connfigure/build/copy/register), therefore ghc_kythe_wrapper is not invoked.

% ./build-stack.sh /tmp/logs mtl
unregistering would break the following packages: proto-lens-combinators-0.1.0.7 proto-lens-protoc-0.2.1.0 proto-lens-descriptors-0.2.1.0 proto-lens-0.2.1.0 conduit-1.2.11 resourcet-1.1.9 parsec-3.1.11 lens-family-1.2.1 mmorph-1.0.9 kan-extensions-5.0.2 adjunctions-4.3 free-4.12.4 exceptions-0.8.3 haskell-indexer-pipeline-ghckythe-0.1.0.0 haskell-indexer-frontend-kythe-0.1.0.0 kythe-schema-0.1.0.0 kythe-proto-0.1.0.0 haskell-indexer-backend-ghc-0.1.0.0 (ignoring)
mtl-2.2.1: using precompiled package
# After deleting ~/.stack/precompiled/x86_64-linux{,-tinfo6}/ghc-8.0.2/1.24.2.0/mtl-2.2.1/*
% ./build-stack.sh /tmp/'logs' mtl
unregistering would break the following packages: proto-lens-combinators-0.1.0.7 proto-lens-protoc-0.2.1.0 proto-lens-descriptors-0.2.1.0 proto-lens-0.2.1.0 conduit-1.2.11 resourcet-1.1.9 parsec-3.1.11 lens-family-1.2.1 mmorph-1.0.9 kan-extensions-5.0.2 adjunctions-4.3 free-4.12.4 exceptions-0.8.3 haskell-indexer-pipeline-ghckythe-0.1.0.0 haskell-indexer-frontend-kythe-0.1.0.0 kythe-schema-0.1.0.0 kythe-proto-0.1.0.0 haskell-indexer-backend-ghc-0.1.0.0 (ignoring)
mtl-2.2.1: configure
mtl-2.2.1: build
mtl-2.2.1: copy/register
# Though ~/.stack/precompiled/x86_64-linux/ghc-8.0.2/1.24.2.0/mtl-2.2.1/ has been regenerated, further `./build-stack.sh /tmp/logs mtl` does not reuse precompiled one, weird.

https://github.com/commercialhaskell/stack/blob/master/src/Stack/Build/Execute.hs#L1141

Setup up continuous integration for repo.

Should run the Kythe verifier tests and the intermediate tests.

Request for Crossreferenced GHC 8.2.1

Hi @robinp,

you recently made a crossreferenced GHC 8.0.1.

Could you make one for 8.2.1 as well (and maybe host them both under qualified URLs, and have http://stuff.codereview.me/#ghc show to the latest one)?

That would be very useful.

Thanks!

(Also I linked your post on Reddit so that not-ghc-devs-subscribers know about it as well!)

Figure out how to deal with (re)exports

It would be nice to mark all the stuff the module exports (even if those are defined in an other module, and exported via Haskell's reexport functionality).

That would enable us to make Kythe-assisted auto-import tool.

Could make a Kythe semantic entity (abuse kind=interface?) to represent "module exports", add anchors childof that entity, with the anchors referencing ing the exported semantic nodes. + @creachadair

Emit all the precise scopes, instead of just a top-level scope

This might be useful for marking function params.
Needed for writing interesting static linters (for example mask + forkIO)

build-stack.sh didn't create any output for me

I ran:

$ ./build-stack.sh tmp-logs mtlparse cpu
cpu-0.1.2: configure
cpu-0.1.2: build
mtlparse-0.1.4.0: configure
mtlparse-0.1.4.0: build
cpu-0.1.2: copy/register
mtlparse-0.1.4.0: copy/register
Completed 2 action(s).

$ cat tmp-logs/.log 
========= FAKE GHC =======
 == pwd: /home/niklas/src/haskell/haskell-indexer
== Passing through..
/raid/stack/programs/x86_64-linux/ghc-8.0.2/bin/ghc --info
========= FAKE GHC =======
 == pwd: /home/niklas/src/haskell/haskell-indexer
== Passing through..
/raid/stack/programs/x86_64-linux/ghc-8.0.2/bin/ghc --numeric-version

Hackage Release

It would be useful to release the packages on hackage. For my purposes, it would mean that I can stop locally packing everything on nix and I can start to push nix support upstream.

Exports should reference the exported entities

Entities in the export list should ref/exports the actual entities.
At simplest, reexports could also ref the actual entities.

Later if needed we could add a special imported at edge that would point to the import anchor the reexport is coming from. But possibly this is premature thinking - what would be a UI usecase that needs this feature?

For reference, see #33 where import indexing was done.

GHC 8 support

The GHC backend now supports GHC 7.10 AST.

The code should be adapted so it keeps supporting that, as well the GHC 8 AST.

Note: while there, backwards support for the GHC 7.8 AST would be very easy, since there was only a slight change.

Export child relations

Export child relations - for datatype/ctors, class/methods, functions/args ...
This is crucial for generating code outline.

Haddock comments support

The GHC backend needs to fetch the Haddock comments, and put them into the common Translate layer (maybe a bit transformed?), associating them to the correct entity. Then the Kythe frontend should format and emit these.

The Haddocks can be fetched using the Haddock API (has to be opened up, see haskell/haddock#595) based on the GHC AST we already have access to.

Note: a special arg needs to be passed to GHC to have the Doc nodes present in the AST. This is likely Opt_Haddock, can be passed in GhcApiSupport somehow. See this GHC test for inspiration.

Emit proper types

Now the type support in Translate layer is pretty weak (strings), and non-existent in Kythe frontend.

We should expose a stripped-down type to the Translate layer. Approximately able to represent forall a (b :: k) . (Ctx a, Foo a b) => [...] -> [...] -> [...].

Kythe schema mapping (work-in-progress)

Constrained function.

Foo a b => a -> b:

abs Avar Bvar
child: tapp constr#1fn#2 vname(b) vname(Foo) vname(a).

Note that Kythe schema convention expects return type to be first parameter.

Also, Avar and Bvar are new absvars bound by the (implicit) foralls. So generally two functions foo :: a -> a and bar :: a -> a won't have the same Kythe type vname, since the absvars will differ. We gave some thought to this, and realized that full proper type-level querying likely needs a separate index, so we won't stress to fit all the abstract details into the Kythe schema.

We choose to fake constraints as additional parameters until Kythe has better support for them. Note that this is can result of things having different types depending on the Constraint tuple order. But since polymorphic things will generally have separate Kythe type vname anyway, this is not a big loss.

Add reference consistency checks

For currently indexed module (easy): somehow report a list of missing decls, that were targeted by refs.

Globally: less trivial. Maybe given a set of packages, only report missing decls among these packages?

Can Kythe do any of these for us in some way (maybe in the verifier mode?). @creachadair

Implement Cabal based wrapper

The current extractor is using Stack, which is not supported by all Haskell projects. Implement Cabal based extractor to fill the gap.

Remove ToolOverrides.

We already have C preprocessor (pgmP) override option, but not C compiler (pgmc) yet. The override is useful if the actual path of the command differs from the values of the GHC settings file.

Use hash for local signatures

As hinted in http://www.kythe.io/docs/schema/writing-an-indexer.html#_cross_references, keeping the signatures in VNames short is crucial for good IO performance.

For non-top/exported entities, now we append disambiguator info like filename, line/col to the signature. Approximately haskell:term:pkg_name:The.Module:thing:/pkg_name/the/file.hs-123-456.

We could cut down on this.

Emit pre-splicing references when using TH

Now $(foo) is indexed with the unspliced content, but the reference to foo itself doesn't get emitted. This causes for example that backreferences on foo don't show the usage sites.

Q expressions bar = [|$(foo)|].
QuasiQuoter name [bar|1+1].

Can we access the pre-splicing info from some of the ASTs?

Export Haskell packages and modules

Take (Cabal package) + (Haskell module) = Kythe package
Emit to index:

defines/binding from module name to package node
childof from top-level decls to module node

Note: import refs to the module are in separate issue, see below.

Run as GHC Frontend Plugin

GHC implements too much administrative magic in ghc/Main.hs to replicate them all in GhcApiSupport. Should restructure the GHC backend to be a Frontend Plugin rather, which will get (almost all) of the magic. See https://ghc.haskell.org/trac/ghc/ticket/14018#ticket.

Notably, the frontend plugin might still need to decide if it's running in Make or Oneshot mode, and act accordingly. For Make mode this means invoking compileFile on non-Haskell sources and sticking the resulting objects to dflags (see GHC's doMake function). No idea for Oneshot.

All this fuss is needed for 1:1 behavior to real GHC invocations. 95% of the time this is not needed, since machine code generation is not needed for indexing most code.

An exception is TemplateHaskell that runs imported function (in which case at least bytecode generation is needed), or when it runs imported FFI-d function (in which case C compilation, machine code generation and linking are needed too). Other exception are some modules that use FFI in particular ways (maybe foreign exports?) which were not content with just generating bytecode. Also, having optimization turned on is not compatible with bytecode gen (but fine with no gen or machine gen). Etc.

+@mpickering FYI.

Folding sift into haskell-indexer

Hi,

We're currently reviewing haskell-indexer for use on auditing work. I wrote a tool called sift which is capable of generating a simple cross-package call graph of a haskell package and writing it to a .json file. Example:

$ sift trace sift-bindings/*/* --flag-binding "ghc-prim GHC.Prim raise#" --call-trace | head -n 30
Flagged binding: ghc-prim:GHC.Prim.raise#
  Used by aeson:Data.Aeson.Encoding.Builder.day
  Call trace:
  aeson:Data.Aeson.Encoding.Builder.day
  |
  +- base:GHC.Real.quotRem
  |  |
  |  +- base:GHC.Real.divZeroError
  |  |  |
  |  |  `- ghc-prim:GHC.Prim.raise#
  |  |
  |  `- base:GHC.Real.overflowError
  |
  `- base:GHC.Err.error
  
  Used by aeson:Data.Aeson.Encoding.Builder.digit
  Call trace:
  aeson:Data.Aeson.Encoding.Builder.digit
  |
  `- base:GHC.Char.chr
     |
     `- base:GHC.Err.errorWithoutStackTrace
        |
        `- base:GHC.Err.error
           |
           `- ghc-prim:GHC.Prim.raise#

Preferably, we'd like to use haskell-indexer to achieve the same thing instead of maintaining two codebases.

I think I can get the same info from TickReference and Tick and XRef gives me a list of TickReferences and also a list of Relation.

I think from there I can produce a graph with Data.Graph by producing a list [(nodeid,node,[nodeid])] where the latter list is "my dependencies", like I do here

-- | Graph all package bindings.
graphBindings ::
     Set Binding
  -> OrdGraph BindingId Binding
graphBindings bs =
  ordGraph (map
              (\binding -> (binding, bindingId binding, bindingRefs binding))
              (Set.toList bs))

then I can produce a simple call graph like this:

callTrace :: OrdGraph BindingId node -> Graph.Vertex -> Graph.Vertex -> [Tree [Char]]
callTrace g start flagged =
  fmap
    (fmap
       (\v' ->
          let (_, bid', _) = ordGraphVertexToNode g v'
           in S8.unpack (prettyBindingId bid')))
    (filterForest
       (flip (Graph.path (ordGraphGraph g)) flagged)
       (Graph.dfs (ordGraphGraph g) [start]))

So I'm 90% confident I can fairly readily get the information I need to obsolete the sift tool. I have some questions:

Do you have any intention of penetrating class instances so that we could, for example, determine that throw# is used by read "x" :: Int because the method instance for Int uses error? We need this for auditing, aside from it being a super cool feature in general.
It seems that there isn't support for base? What's your approach on that? We need this for auditing. Here's how sift tackles the tooling issues:
- sift-frontend-plugin is used when building GHC's lib dir, on the base package specifically. You just inject --frontend in the right place after building sift-frontend-plugin in the same package set. That lets you generate a profile of base. If you had trouble with this on haskell-indexer, maybe I can help out getting that to work.
- sift-compiler - this is a copy/paste of the commandline interface from intero and ghci, and on start it immediately generates a profile of all the modules. This lets you do stack ghci --with-ghc sift-compiler and then you're done. I believe cabal repl --with-ghc sift-compiler would also work, but I haven't tested it. Why not just stack ghci --with-ghc ghci --ghci-options '--frontend Sift.FrontendPlugin'? Because GHC rejects frontend plugins when used with --interactive. 🤷‍♂️
Finally, would you accept a PR that would contain some code for the above code tracking? It'd be nice to fold my code into haskell-indexer, and maybe have a UI that could display a call graph interactively.

Emit ref/call from Kythe frontend.

At least make this configurable, in some situations it's useful to have the ref/call edge present.

This is also a long-arching problem: Haskell (or partial application and first-class functions in general) don't have a good notion for distinguishing a reference from a call.

Decide the story about record fields.

See comments in indexer backend, Kythe verifier comments [1], also https://groups.google.com/forum/#!topic/kythe-dev/Mvus07b8c-U.

[1]: in kythe-verification/testdata/basic/RecordReadRef.hs, also RecordWriteRef.hs

Support cross-language linking with C

See also #15 , but less clear path forward, as Haskell doesn't need an explicit SWIGging phase, can just live with foreign imports. See https://groups.google.com/forum/#!topic/kythe-dev/Fb7HmZffRtw for a related discussion.

The core of the matter is that when emitting references to the C code, we need to use the VName emitted by the C indexer, but that is unpredictable (we need side-channel info from the C indexer somehow).

The matter is further complicated by hsc2hs, inline-c and maybe other glue libs. A proper research is needed. Maybe we don't need to support all the edge cases, or these cases are actually separate instances to solve on their own.

Make scripts kythe location agnostic

Now we hardcode /opt/kythe, instead we could expect an environment variable. Kythe uses
KYTHE_ROOT_DIRECTORY and KYTHE_OUTPUT_DIRECTORY in their scripts.

Extraction: where to put generated or setup code

Spotted by @MaskRay:

Sometimes $PWD is not something like ...../mtl-2.2.1 ($package-$version)

Currently the Stack GHC wrapper script prepends the package name to the generated filepaths (see wrappers/stack/ghc), and also assumes that the files passed in to the compiler command are relative to the current dir (which is usually the case, thus the removal of the ./ prefix in the script).

But maybe sometimes the paths are absolute (like pointing to /home/user/.stack/setup-exe-src/....hs or /tmp/setup-xyz/Setup.lhs).

Now we seems to stash these files as relative under the package dir. Which is not too bad, at least they are bundled with the related package. But if these files would be shared/reused between compiling packages, the duplication would be awkward (does that ever happen?)

Just mentioning that the Kythe frontend at least provides the notion of 'root' which we don't use now. Using roots one could segregate (or namespace) paths, which is useful if independent paths have chance to collide inside a single corpora, or if generated files should be explicitly separated.

I'm not sure it's our task to solve this "once and for all" (forall?), but could provide some path rewrite rules facility, and let various build systems set up the rewrite rules as appropriate for them.

Deduplicate indexer output

Use LRU or bloom filters. Kythe dedupes anyway, but when we start emitting types there will be a lot of unnecessary content to process.

Support crossreference for imported modules and entities

As final result, the named entries in the import list should be properly hyperlinked.

The module names, like Foo in import Foo (bar, baz), should ref/imports the module node (see #8 ). Also, bar and baz should ref/imports too.

Dotted imports

import Foo (Bar(..)) - should we emit multiple ref/imports edges from the ..s? Or just do nothing there.

Aliased imports

What about aliased imports (regardless of qualification)? For example:

import Data.Text as T
...
foo = T.append "x" "y"

Easiest is we do nothing. In any case, T.append will still refer to the same node. If we emitted a defines/binding on T of the import line, then the T of T.append might ref that. But there's no really suitable node in the kythe schema (unless we shoehorn talias + aliases), and the benefit is dubious. Also, the def/binding would need to come from an implicit anchor, since multiple modules can be aliased to the same name (so the T in the import line would also be a reference).

Windows & Mac support.

The installation instructions are Linux specific. Say in the docs if it works on Windows, or hasn't been tested etc.

Indexing packages coming with GHC

I was trying to index bytestring and found out that it couldn't be done with haskell-indexer and also it looks like it breaks package db by unregistering bytestring so I needed to run stack setup --reinstall to fix the situation. Also it seems to be related to the question from @chrisdone in #79 about base but it looks like base is not enough and dependencies of GHC which come with it need special treatment. I see in #56 scripts from you @robinp but it looks like it doesn't cover libraries which are in git submodules of GHC repo - why didn't they get indexed? BTW shouldn't those scripts also be included into this repo with some file describing how to index GHC itself (at the moment it's not clear to me where that /opt/ghc/bin/ghc come from - probably it's about your custom install of GHC into /opt/ghc?)

Multi-clause definitions are only linked once

If a file contains the following definition for go, only the first occurence of go is linked. The second occurrence is not linked.

go 0 = ()
go 1 = ()

http://stuff.codereview.me/lts/9.2/ down or slow?

Hey @robinp,

I can't load http://stuff.codereview.me/lts/9.2/ currently, it seems to load extremely slowly and after 2 minutes times out (some resources do seem to load eventually).

Could you have a look if that should still be working?

Thanks!

ghcWithIndexer nix packaging

Nix provides a very convenient wrapper, ghcWithHoogle which generates hoogle indexes for all dependencies of your project. It would be similarly convenient if there was a ghcWithIndexer wrapper which instead built kythe indexes using haskell-indexer.

I indent to implement this but this ticket is for the contingency that I don't.

Emit MarkedSource artifacts from Kythe frontend

See http://www.kythe.io/docs/schema/marked-source.html - these are needed for UIs or tools to format names of entities in context-dependent ways.

Support type families

They don't seem to get crossreferences.

Emit reference to instance method instead class method

When the instance is fully resolved.

Basic plan (probably doesn't handle edge cases etc):

see if HsVar's id's idDetails is ClassOpId (also isClassOpId_maybe)
if so, get the classTyCon from the Class of the ClassOpId
look if the HsWrap wrapping the HsVar contains WpEvApp (EvId x), where x's type is of form TyConApp <the classTyCon> ...
if we found it, then x is the dFunId... but how do we go from that (knowing the Class and the method name) to the instance method var?

Idea: get the module of the dFunId name using nameModule, then GHC.getModuleInfo on it. Then we can lookup the ClsInst list in that module... but it doesn't refer the actual members, just the DFunId, again. Ok, not the best idea. Can we take that DFunId apart?

Idea#2: assign Tick to instance methods by concatenating the DFunId tick with the method name, so it's trivial to construct the Tick reference from the locally available info. Sounds like a much better idea.

This can work - but the instance method decls need to be taken from the typechecked tree, where we can harvest the related dFunId (the abe_mono is easier to find the classTyCon, since there's no context) by looking at which $c... gets applied in the $d... binding.

Update to GHC-8.4.3

The necessary changes are on the branch on my repo - https://github.com/mpickering/haskell-indexer/tree/ghc-8.4.3

It's possible that mpickering@ae09b7f can be cherry-picked without the other changes.

No edge from identifier occurrence in type declaration

For example,

add :: Int -> Int -> Int
add x y = undefined

In the first line add wouldn't have a reference.

Index post-processed sources

Brought up by @mpickering. A few things to sort out for that question:

Does GHC provide convenient access to the post-processed ASTs, ideally with post-processed spans?
If not, are the post-processed sources accessible and can we do an extra compilation for them to get the spans?
How would we deduplicate definitions/references that are present in both the original and post-processed sources? (Sidenote: IIRC GhcAnalyser drops references that originated from generated code, but not sure if TH falls under that condition).
Where would we place post-processed code (this is a valid question for CPP too)? Kythe supports virtual roots, and generally we can emit whatever code fragments we want anywhere in the tree, but it would have to be thought up what reference/generates/... edges would be present.
Would we emit full postprocessed sources (more problematic duplication-wise), or do some smart thing to just put the TH-generated source fragments in virtual files?

+@creachadair: does for example the Kythe C++ indexer emit virtual fragments for un-CPP-d code? Do you have any takeaways from earlier attempts on this topic?

Document how to map GHC AST node to Kythe schema

How to map a small part of GHC AST to Kythe schema. A small tutorial.

where to look for GHC AST documentation (haddock, github.com/edsko/ghc-dump-tree, maybe some useful article from https://ghc.haskell.org/trac/ghc/wiki)
where to look for Kythe schema (http://kythe.io/docs/schema/)
what code needs to be changed (extending Translate.hs data types, mapping from GHC AST to Translate.hs DS in Ghc.hs, adding a new type in Typed.hs, doing the mapping in Kythe.hs). If possible, add a reference to a git branch/push/commit/whatever is kept visible in github with the complete code example of adding a mapping.
writing unit tests for the mapping (a small example)
writing kythe verification tests (what they are, where to put them, how to run them, where to get more documentation)

Support cross-language linking for Protobufs

Crosslink proto-lens stuff with protocol buffers.
Figure out how to do this schema-wise.
Protocol buffers might be a good start since Kythe has a nice documentation about them.

Kythe extractor support

Now the ghckythe-wrapper can be used in place of the ghc command to emit artifacts, which is fine for local, non-isolated indexing.

The more proper way would be to add extractors for the build systems (here Stack, Cabal build, Cabal new-build, ...) that save all the required inputs in index packs, so the separate indexing phase can happen exclusively based on the hermetic index pack data.

This would make it possible to do reproducible and/or distributed indexing.

The main tasks are:

1. Identifying (per build system) how to track down all the arguments and resource dependencies needed for the build.

This is not always trivial, as the deps need to capture auto-generated inputs etc too.

There's also the question of system-level dependencies (like global shared libraries) - should these be assumed omni-present on both the extractor and the indexer machines? Should they be added to the index pack?

2. Generic support for operations (reading/writing) index packs.

3. Making the haskell-indexer work out of the index pack.

How should one use the index pack's content?

The naive solution is to unpack the deps needed by the ComplilationUnit to some local place, and working from there. Attention has to be paid that the build is isolated, and GHC doesn't pick up unexpected dependencies.

The more desired solution is to make the indexer (and so GHC) pull the dependencies on-demand from the index pack (instead of prefetching and extracting). This has the benefit that in case of over-eager extractors (that include more resources in the pack than strictly needed) it's still only the needed data that's pulled.

@creachadair

Cross-ref Pattern Synonyms

Both uni and bidirectional.

kythe-proto relies on unbundled storage.proto

The build for kythe-proto needs access to kythe's storage.proto file. It achieves this by symlinking to ../../../../../third_party/kythe/kythe/proto/storage.proto. Would it be better to just vendor this specific file? It makes building the package in isolation much more difficult and I had to add some special packaging logic when I packaged it for nix.

Add backend-ghc tests for exotic TH cases.

For example, tests where:

an object file (used for TH FFI) is put in args.
module has foreign exports and TH too.
an other package is referenced on the args (through custom package db)
TH is executed by using exports from other package which does FFI

Kythe verifier Import test uses hardcoded package

Fails with GHC 7.10:

export STACK_YAML=$(readlink -f stack-6.30.yaml)
stack install
cd kythe_verifier
./test.sh

Verifying: testdata/basic/ImportsRef.hs
Could not verify all goals. The furthest we reached was:
  testdata/basic/ImportsRef.hs:5:6-5:89 @"Data.Set" ref/imports vname("containers-0.5.7.1:Data.Set", "", "", "", "haskell")

In the broader context, we might want to make verifier tests dependent on GHC version. Maybe fire up the tests with a light Haskell wrapper (instead shell script), where we can ifdef which tests we want to run. That way CI integrating those tests would also be easier (could move them into the ghc-kythe package).

cc @ivan444

Index used extensions and annotation

Create a virtual source for them.

Evaluate indexer performance

Profile to see where we spend time.

One suspect is the uniplate traversals. Could try to use version from Data.Data.Lens, which caches the possible paths and could speed traversals up.