bitfunnel / bitfunnel Goto Github PK

A signature-based search engine

License: MIT License

CMake 0.69% Batchfile 0.04% Shell 0.43% C++ 89.59% Go 0.05% R 0.07% Makefile 0.34% M4 0.33% C 0.26% Python 6.09% PowerShell 0.06% Assembly 2.03%

search-engine search-in-text search

bitfunnel's Introduction

BitFunnel

This repo contains the code for the BitFunnel index used by Bing's super-fresh, news, and media indexes. The algorithm is described in BitFunnel: Revisiting Signatures for Search, a paper presented at SIGIR 2017. This video gives a good overview of the algorithm.

The codebase here was published to allow the research community to replicate results from the SIGIR paper. The documentation is pretty thin, but we encourage you to look at the following :

Dependencies

In order to build BitFunnel you will need CMake (2.8.11+), and a modern C++ compiler (gcc 5+, clang 3.5+, or VC 2015+). You can run CMake directly to generate the appropriate build setup for your platform. Alternately, we have some scripts that have the defaults that we use available.

*nix

For *nix platforms (including OS X),

./Configure_Make.sh
cd build-make
make
make test

Note that while these instructions are for a make build, it's also possible to build using ninja by changing the cmake command to create ninja files instead of Makefiles. These aren't listed in the instructions because ninja requires installing an extra dependency for some developers, but if you want to use ninja it's available via apt-get, brew, etc., and is susbtantially faster than make.

Ubuntu

If you're on Ubuntu 15+, you can install dependencies with:

sudo apt-get install clang cmake

On Ubuntu 14 and below, you'll need to install a newer version of CMake. To install a new-enough CMake, see this link. If you're using gcc, you'll also need to make sure you have gcc-5 (sudo apt-get install g++-5).

To override the default compiler, set the CXX and CC environment variables. For example, if you have clang-3.8 installed as clang-3.8 and are using bash:

export CXX="clang++-3.8"
export CC="clang-3.8"

OS X

Install XCode and then run the following command to install required packages using Homebrew (http://brew.sh/):

brew install cmake

BitFunnel can be built on OS X using either standard *nix makefiles or XCode. In order to generate and build makefiles, in the root BitFunnel directory run:

If you want to create an Xcode project instead of using Makefiles, run:

./Configure_XCode.sh

If you use XCode, you'll have to either re-run Configure_XCode or run the ZERO_CHECK target when the CMakeLists changes, e.g., when source files are added or removed.

Windows

You will need these tools:

CMake (http://www.cmake.org/download/)
Visual Studio 2017 with C++ compiler (the free version)

Note: If you install Visual Studio for the first time and select the default install options, you won't get a C++ compiler. To force the install of the C++ compiler, you need to either create a new C++ project or open an existing C++ project.

Clone the BitFunnel repository and then run the following command in BitFunnel's root folder:

.\Configure_MSVC.bat

Note: You will need to modify the CMake -G option if you use a different version of Visual Studio. Bitfunnel must be built as a 64-bit program, so 'Win64' must be part of the specified G option text.

At this point, you can open the generated solution BitFunnel_CMake.sln from Visual Studio and then build it. Alternatively, you can build from the command line using cmake --build build-MSVC.

bitfunnel's People

Contributors

Stargazers

Watchers

bitfunnel's Issues

README.md for CmdLineParser

Determine if we need query boosting

Some mechanism with the same goal is obviously needed if we have 1T or maybe even 1B documents. It's not clear if any of our open source users will have anywhere near that many documents, and query boosting is both botled in on a way that makes it easy to not include, and contains a lot of complexity.

Set up Appveyor CI

Evaluate whether we need more unit tests.

The following Utilities classes are missing unit tests as of d3fc186:

ConsoleLogger
LockGuard
Logging
LogLevel
MurmurHash2
Mutex
NullLogger
Stopwatch
ThreadManager (tested indirectly by TaskDistributor)
ThreadsafeCounter (tested indirectly by TaskDistributor)

README.md for CsvTsv

Use of volatile

The old BitFunnel code has some uses of volatile that are safe on VC++ but may be unsafe on other compilers. In particular, on older versions of VC++, volatile acted as a memory barrier. On newer versions, the situation seems more complicated?

And clang and gcc don't have the same guarantees. I've removed uses of volatile from inside the Token* code, and I'm presently doing the same for the SimpleHash code.

The old codebase also contains volatile in:
Band*
ThreadsafeFlag
ThreadsafeState
ShardUnitTest
RecyclerUnitTest

I believe we shouldn't need Band*, ThreadsafeFlag, or ThreadsafeState, but that still leaves a couple of places where we might need to excise volatile.

Why did CmdLineParser use GetModuleFileNameA?

To get the porting started, I'm going to change this to use argv[0] instead of making some kind of multi-platform shim for both Linux and OS X, but there may be some reason that didn't work.

Talked with @MikeHopcroft, who didn't remember the reason for using GetModuleFileNameA and I don't know if we have anyone else who would know.

Run experiment on cacheline padding

DocInfo gets padded out to a nice size to avoid having hits span two cachelines. However, on Haswell, spanning a cacheline isn't that expensive in the general case unless you span a page boundary, and I believe I've heard that on some newer chips even spanning a page boundary isn't that bad (ofc. you still pay for bringing the page in).

This seems pretty simple to test experimentally once we have a reasonable benchmarking setup. Getting a reasonable benchmarking setup without pulling in a bunch of stuff we don't want to pull in may be non-trivial.

How compact is our data representation?

Just out of curiosity, if we grab main memory and run it through some compression algorithm, will it compress? Some architectural design decisions assume that the data is basically as compressed as is possible. Is that right?

Utilities should build and pass tests on linux and osx.

Remove C-isms?

We have a lot of C-isms in the older BitFunnel code. From talking to @MikeHopcroft , there seems to be no particular reason for this expect that this code is old and was written by people coming in with a C background.

I suspect this bug is so low priority that we'll never get to it, but it would be nice to clean some of this up.

Update Mike's squirreled away README with more detailed build instructions

I'm going to start filing issues on us here so we don't lose track of the TODOs we discussed recently. I'm not really a fan of github issues, but it seems good enough for now and we can always migrate later if we have a large enough volume of issues that it's a problem.

Some things that I suspect aren't in the current README are that, on Windows

you shouldn't install cmake from chocolatey because it's broken
if you have a fresh Visual Studio community installed with "default" options, cmake won't work because VS doesn't actually install a C++ compiler until you either create or open a C++ project.
You'll probably need to adjust the Configure_Windows_MSVC.bat script to point to the correct VS version (better yet, we should either auto detect that or just have it point to the most recent version by default).

Consider removing BitFunnel/src

The /src directory is not as relevant now that UnitTest is buried down in BitFunnel/src/project/UnitTest.The only peer to src is now inc.

With src removed, the root would look something like

Bitfunnel
    inc
        BitFunnel
    Common
        CmdLineParser
            src
            UnitTest
        CsvTsv
            src
            UnitTest
        Utilities
            src
            UnitTest

Memory allocation for users without dedicated machines

We should probably have some way to handle this for users that don't want to decide in advance how much memory they want to allocate to slices because, for example, they're running a single instance on a machine that's also running other software.

Will everything be ok if we just hook up jemalloc/tcmalloc/dlmalloc? Probably, but we should check to make sure this doesn't degrade performance in a surprising way or cause fatal errors due to fragmentation.

Figure out public header file targets in CMake

How should we treat header files in CMakeLists.txt? Are they listed explicitly? Which ones? Only public files? All of them? Rationale?

The rationale we have today is that it groups them in a folder in the build if you're on Windows (inside the .sln).

Determine whether or not we need TermTable/BandTable training

It's probably not reasonable to ship the 1.6GB that we get as output from training. It's probably also not reasonble to expect that most (any?) open source users have enough training data to be able to effectively train something similar.

It seems like we should be able to get away with something much simpler for anyone without, say, 1B documents, though.

Why is this-> needed in CsvTsv's Table.h

In the first version of table.h, line 737, Dan had to change
m_data.push_back(GetValue());
to
m_data.push_back(this->GetValue());
in order to get it to compile under GCC. We should understand why this change was actually necessary.

Replace MurmurHash?

There are now other hash functions that are allegedly substantially faster which also have nice randomness properties. When we get a proper benchmarking setup, we should see how much time we spend hashing and consider replacing our hash.

Some things we might want to look into include FarmHash and SipHash.

Considering how small the things we're hashing are, we should also consider less "fancy" hash functions, which have a good chance at being faster on really small chunks of data. We could even consider using the crc32 intrinsic, which has single-cycle throughput.

Remove assert / assert.h

We have a few places where we're temporarily using assert because LogAssert hasn't been ported to Linux yet. We should get rid of those and make everything consistent as soon as LogAssert is ported.

Consider renaming UnitTest directories to test

This would make the capitalization more consistent. Basically, inc, src, and test would all be lower case. Names of projects would be PascalCase. The name "test" is less specific than "UnitTest", but this is unlikely to cause a problem.

Query plans, BM25f, etc.

Do we really need complicated query support that requires NativeJIT? It's possible that some naïve thing is actually sort of ok. For Bing, we have a lot of really complex training, but if we just support something like what Lucene, it's not clear that we need all this.

TODO: do the experiment of trying a naïve ranking algorithm to see how much this matters.

CmdLineParser should support -- arguments.

Right now CmdLineParser supports -flags and /flags. It would be more consistent with Linux and GNU conventions if it supported --flags.

`make test` does not depend on test projects

On OS X, if you run make test from a clean build, it will fail, claiming it can't find the test binaries.

If you run make && make test they will be found, and then run.

Therefore, we conclude that the test target does not depend on the target binaries.

Add appropriate copyright message to MurmurHash.

This might be a good place to start investigating: https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp

Replace most cases of #ifdef BITFUNNEL_PLATFORM_*

We use BITFUNNEL_PLATFORM_WINDOWS and friends in ifdefs, but in a lot of cases what we really mean is MSVC vs. clang vs. gcc. But not in all cases.

But this means that, for example, building with clang on Windows probably isn't going to work?

Undo commit 8eb0dd2f (if (BITFUNNEL_PLATFORM_WINDOWS))

As soon as we have at least one file that compiles in Linux/OSX, we should back out this change.

README.md for Utilities

Document classes as of d3fc186.

CsvTsv::TypedColumn<unsigned int>::Read(): incorrect behavior for negative numbers.

Reading "-1" on Windows results in 0xffffffff. The same call on OSX fails. According to the discussion at http://www.cplusplus.com/forum/general/153669/, some implementations will silently return 0.

Replace IInputStream with something that looks like an istream?

The original rationale for IInputStream, from Andrija:

From what I remember, we found that reading serialized plan sent from MLA to OS from standard input stream was a bottleneck and needed to address that.

I think the bottleneck was related to global locale lock. I forget whether we were using stringstream or our vector stream implementation. If the latter, in addition to the locale lock we likely had a problem there because it allowed reading one character at the time.

IInputStream has a Read() method which can read a chunk of bytes and also doesn’t get affected by locale. The cost is the virtual function call, but that was OK because we were reading large chunks.

But this could be done with something that acts like an istream.

Determine if we need the boulders/pebbles/dust algorithm

Word on the street is that the algorithm totally falls over if we remove the escape hatch when placement fails 20 times in a row. If that's true, that seems to indicate that many (most?) placements are into full locations, which implies that the algorithm isn't doing much to help us.

Additionally, this has been running in production for quite a while and the system still seems to work ok despite never having been re-trained to adjust for the change in frequency of words over time.

Some questions are:

Do we need it at all? If we're really bailing that often, why can't we just do random placement?
How much do we lose from doing random placement over optimal placement?
If we need this, why is 20 the right number? Why not 2 or 200 or something else?

Determine strategy for register allocation on conditionals

The current implementation is a hack. It's possible that's fine because of how it's used, but we should discuss this and make a concious design decision if that's the case.

Required version of clang?

From the example code for std::atomic::compare_exchange_weak

// Note: the above use is not thread-safe in at least 
// GCC prior to 4.8.3 (bug 60272), clang prior to 2014-05-05 (bug 18899)
// MSVC prior to 2014-03-17 (bug 819819). The following is a workaround:

We already require GCC5 and VC2015, but we might have to bump our required version of clang if we use this inside our threading primitives.

This may not matter; filing this anyway in case I swap this in so I don't forget about this.

CMake target TOPLEVEL doesn't show some files.

Right now certain files are commented out of the TOPLEVEL target because their inclusion either has no impact (i.e. they don't show up) or it causes the TOPLEVEL project to show no files.

CmdLineParser unique_ptr

There's some code in CmdLineParser that, for old VC++ compatibility reasons, used auto_ptr. It's mostly been converted to use unique_ptr, but there's still an array of "raw" pointers that could be converted to an array of unique_ptr.

Change TokenTracker.h m_remainingTokenCount from unsigned int to int or int64_t?

Changing to a signed type would make checking for underflow more intuitive.

Changing to a 64-bit type would preclude the possibility of running into trouble because we have > 2B tokens outstanding. I don't think we know of any actual instances of that, but considering how few instances of the variable we have, I don't think we're really saving much memory by using a potentially 32 bit value.

Make ChunkReader allocate-free

We have a quickly written version in the index branch and should convert it to something with better performance after the first part of the pipeline works.

Adding CMake compiler options/flags

It seems like we should use something like http://stackoverflow.com/questions/23995019/what-is-the-modern-method-for-setting-general-compile-flags-in-cmake, rather then slamming stuff into the COMMON_CXX_FLAGS / CMAKE_CXX_FLAGS string.

I may make this change as part of BitFunnel/NativeJIT#10, but if not we should keep this in mind when making changes.

Benchmark against... something?

Want to find some good workloads to benchmark for query rate, document ingestion, etc.

Merge LoggerInterfaces with BitFunnel

Move inc/LoggerInterfaces to BitFunnel/LoggerInterfaces.
Change namespace from Logging to BitFunnel.

Set up Travis CI

Separate CmdLineParser unit tests

Right now CmdLineParser does 73 tests inside of a single gtest TEST(). Consider breaking these out into separate tests. Need to change the baseline generation code to generate gtest TEST() syntax.

`IFieldFormatter::WriteField` causes compilation error on Clang/OS X when we construct an `OutputColumn<size_t>`.

See note in DocumentLengthHistogram.cpp, in which we currently use OutputColumn<uint64_t> to get around this compiler error.

The essence of the error is that the call to WriteField becomes ambiguous when you use size_t.

Modulus operator in SimpleHashSetBase::TryFindSlot

Given that we have our own hash table implementation, we should consider limiting the capacity to powers of 2 and using a mask instead of a modulus. The last time I wrote my own hash table implementation I got a substantial improvement on hash-heavy workloads by doing that.

This is minor compared to getting things up and running again, but it's a trivial change that can have an outsized performance impact so it's worth looking at once we have a system that runs.

Determine safety of save/restore from files

See http://danluu.com/file-consistency/ and the tool from https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

Replace runtime_error with BitFunnel::FatalError or BitFunnel::RecoverableError

It sounds like there's a change coming in that lets us replace are uses of runtime_error. This is basically a TODO to myself to replace this in code that just went through code review and got checked in, and also check for other instances of the same thing.

Figure out if having exact OR queries is feasible

CsvTsv should build and pass tests on linux and osx.

CmdLineParser should build and pass tests on linux and osx.

Determine which VC++ warning removal pragmas we don't need

CMakeLists.txt for tests

We currently declare tests in two different locations, one of which is the top-level file. It's a bit weird to have to add tests from the top level file. It's possible that moving enable_testing() before add_subdirectory will remove the need for the duplicate top-level test stuff. If not, we should figure out what the problem is.