Code Monkey home page Code Monkey logo

elfie-arriba's Introduction

Arriba

Arriba is a C# data engine designed for instant structured search, free text search, and data exploration across large single machine datasets.

Give Arriba CSVs or simple C# arrays and it will automatically create tables with appropriately typed columns.

Arriba features a simple, elegant query syntax so that users can write anything from a web-style query "louvau -Closed" to a fully structured query ([Assigned To] = "Scott Louvau" AND ([State] != "Closed" OR [Remaining Work] > 0)).

Arriba has a beautiful website to make search and exploration easy, with comprehensive query suggestions, a configurable listing, customizable item details, and a Grid for quick analytics. Query suggestions go beyond showing just column names and the search syntax by adding "Inline Insights", showing query-specific top values and distributions for columns and showing which columns word searches are matching to answer questions directly and help users construct the queries they really intend.

Arriba exposes a comprehensive HTTP service you can use to programmatically run queries and aggregations, get query suggestions, and add/decorate/update/delete rows.

You can even host the Arriba engine directly in your C# process, creating tables in-memory and making custom column types, queries, and aggregations by implementing simple interfaces.

See the Arriba QuickStart to get Arriba and the Website running with sample CSV data in 15 minutes.

Elfie

Elfie is a library which makes it easy to build memory-efficient, extremely fast item sets providing search and traversal. Elfie uses the "structure of arrays" layout model for performance. A set class contains multiple columns of primitive types, enums, or a replacement string type, String8. Items of the set are structs which point to the set and a specific index. This gives the performance of structs (no allocations) with the convenience of classes (updates change all references of the item without copying). Elfie has primitives to provide text search (MemberIndex), define hierarchies (ItemTree), and define graphs (ItemMap). It also provides very fast read and write of CSV, TSV, and JSON via a consistent interface.

As an example, a set of ~5M Active Directory items with five columns and ~25M links in a graph fit in ~800MB, loads in ~600ms, and can be traversed at a rate of ~15M links per second, measured on a Surface Book i7.

Contributing

Arriba and Elfie are not owned by a dedicated team, so while fixes and small changes are welcome, our ability to include contributions and comment on design changes is limited. For larger fixes and design change ideas, please contact us so that we can comment on the design or suggest working in another fork.

Please:

Arriba and Elfie performance depend on minimizing allocations, boxing, and indirect method calls, so compare performance of real-life scenarios involving your code to avoid regressions. They were created by Microsoft to enable great internal tools, and we've opened them hoping they will enable you to create great search and analytics tools in your favorite language. =)

elfie-arriba's People

Contributors

dannychenmsft avatar genlu avatar jeffersonking avatar microsoft-github-policy-service[bot] avatar msftgits avatar rtaket avatar sbeland avatar scottlouvau avatar sharwell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elfie-arriba's Issues

Improve logging capabilities

The current code base has a mix of logging styles and still leave much to be desired. Improve the logging to use and enforce structured logging as well as enable a cloud backed solution such as application insights.

Add IPRange column type

  • Requested by JBL for storing single IPs and IP ranges.
  • Has a start and end address, inclusive, translated to integers from dotted notation.
  • Searches for a single IP should translate to an integer and find values in range.
  • Searches for an IP prefix (10.120) should translate to a range (10.120.0.0 - 10.120.255.255) and check for any intersection from the column ranges.

We could also use a column capable of containing IPv6 addresses, though they might be separate column types.

XForm: Null and Empty handling issues

Several Issues:

  • It looks like 'where [Column] = ""' isn't working correctly.
  • Add a function to allow checking for null or empty string.
  • Add an easy way to turn "NULL" the string into a null.

Move to using the common aspnet controller pattern

The present implementation is a custom request handling decorator pattern, that while it works is unnecessary and complex. It increases the amount of time that someone new to the code base needs to ramp up.

Race condition in AddOrUpdate when adding and updating an item in the same block

Because ChooseSplit is done as a parallel operation, an item that is added and then updated in the same call to AddOrUpdate might be evaluated in random order. To fix this, if we detect an update to an add, we should first sort the incoming data and then evaluate it in the order presented (the same as if each add or update were made individually)

Provide alternate hashing/distribution schemes

Allow support for custom hash application.

For ordered, integer keys, a round robin (modulus) is faster and more balanced than the MurMur3 hash. For time ordered data, fill-then-spill (fill a partition before moving on) is best.

Enable running in Azure using by using persistent cloud storage

Investigate what it would take to remove the hard dependency on the file system and move to a cloud storage provider. The first pass of this would be a naïve implementation, however a future version may enable versioning of the data and global replication.

Code sample for quick substring search in a large list of strings

Hello folks!

I have a million strings (file paths) in an array. When user enters a search pattern I need to quickly display a list of all file paths from my array that have that search pattern as a substring. I only need the top N results.

What's the minimal code sample using Elfie to achieve my scenario? It's OK to take time to prepare any data structures in advance, but after that searches should be fast.

Thanks!

Instant update after UI actions

Hello folks,

Thanks a lot for this awesome project.

I'm noticing that most of UI actions are only applied after changing the current view to another one. Examples of that;

  • Click on a lisiting result header to sort result
  • Show / hide recent queries list
  • Delete an item from the recent queries list
  • Mark search query as a favourite one

I find this behavior consistent and this left me wondering; is that on purpose? or am I missing something?

Select OrderBy is not stable if a non-unique column is used for ordering

If a non-unique column is used for ordering, compute returns the top N matches. But those can be out of order w/rt to a secondary sort ordering (ID column is implied for now). Merge assumes these are in order. This was made worse by parallel/cascading merges because now merges happen in arbitrary order instead of partition order.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.