Code Monkey home page Code Monkey logo

eland's Introduction


PyPI Version Conda Version Downloads Package Status Build Status License Documentation Status

About

Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API.

Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, or scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.

Eland also provides tools to upload trained machine learning models from common libraries like scikit-learn, XGBoost, and LightGBM into Elasticsearch.

Getting Started

Eland can be installed from PyPI with Pip:

$ python -m pip install eland

If using Eland to upload NLP models to Elasticsearch install the PyTorch extras:

$ python -m pip install eland[pytorch]

Eland can also be installed from Conda Forge with Conda:

$ conda install -c conda-forge eland

Compatibility

  • Supports Python 3.8, 3.9, 3.10, 3.11 and Pandas 1.5
  • Supports Elasticsearch clusters that are 7.11+, recommended 8.13 or later for all features to work. If you are using the NLP with PyTorch feature make sure your Eland minor version matches the minor version of your Elasticsearch cluster. For all other features it is sufficient for the major versions to match.
  • You need to install the appropriate version of PyTorch to import an NLP model. Run python -m pip install 'eland[pytorch]' to install that version.

Prerequisites

Users installing Eland on Debian-based distributions may need to install prerequisite packages for the transitive dependencies of Eland:

$ sudo apt-get install -y \
  build-essential pkg-config cmake \
  python3-dev libzip-dev libjpeg-dev

Note that other distributions such as CentOS, RedHat, Arch, etc. may require using a different package manager and specifying different package names.

Docker

If you want to use Eland without installing it just to run the available scripts, use the Docker image. It can be used interactively:

$ docker run -it --rm --network host docker.elastic.co/eland/eland

Running installed scripts is also possible without an interactive shell, e.g.:

$ docker run -it --rm --network host \
    docker.elastic.co/eland/eland \
    eland_import_hub_model \
      --url http://host.docker.internal:9200/ \
      --hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \
      --task-type ner

Connecting to Elasticsearch

Eland uses the Elasticsearch low level client to connect to Elasticsearch. This client supports a range of connection options and authentication options.

You can pass either an instance of elasticsearch.Elasticsearch to Eland APIs or a string containing the host to connect to:

import eland as ed

# Connecting to an Elasticsearch instance running on 'http://localhost:9200'
df = ed.DataFrame("http://localhost:9200", es_index_pattern="flights")

# Connecting to an Elastic Cloud instance
from elasticsearch import Elasticsearch

es = Elasticsearch(
    cloud_id="cluster-name:...",
    basic_auth=("elastic", "<password>")
)
df = ed.DataFrame(es, es_index_pattern="flights")

DataFrames in Eland

eland.DataFrame wraps an Elasticsearch index in a Pandas-like API and defers all processing and filtering of data to Elasticsearch instead of your local machine. This means you can process large amounts of data within Elasticsearch from a Jupyter Notebook without overloading your machine.

Eland DataFrame API documentation

Advanced examples in a Jupyter Notebook

>>> import eland as ed

>>> # Connect to 'flights' index via localhost Elasticsearch node
>>> df = ed.DataFrame('http://localhost:9200', 'flights')

# eland.DataFrame instance has the same API as pandas.DataFrame
# except all data is in Elasticsearch. See .info() memory usage.
>>> df.head()
   AvgTicketPrice  Cancelled  ... dayOfWeek           timestamp
0      841.265642      False  ...         0 2018-01-01 00:00:00
1      882.982662      False  ...         0 2018-01-01 18:27:00
2      190.636904      False  ...         0 2018-01-01 17:11:14
3      181.694216       True  ...         0 2018-01-01 10:33:28
4      730.041778      False  ...         0 2018-01-01 05:13:00

[5 rows x 27 columns]

>>> df.info()
<class 'eland.dataframe.DataFrame'>
Index: 13059 entries, 0 to 13058
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   AvgTicketPrice      13059 non-null  float64       
 1   Cancelled           13059 non-null  bool          
 2   Carrier             13059 non-null  object        
...      
 24  OriginWeather       13059 non-null  object        
 25  dayOfWeek           13059 non-null  int64         
 26  timestamp           13059 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 80.0 bytes
Elasticsearch storage usage: 5.043 MB

# Filtering of rows using comparisons
>>> df[(df.Carrier=="Kibana Airlines") & (df.AvgTicketPrice > 900.0) & (df.Cancelled == True)].head()
     AvgTicketPrice  Cancelled  ... dayOfWeek           timestamp
8        960.869736       True  ...         0 2018-01-01 12:09:35
26       975.812632       True  ...         0 2018-01-01 15:38:32
311      946.358410       True  ...         0 2018-01-01 11:51:12
651      975.383864       True  ...         2 2018-01-03 21:13:17
950      907.836523       True  ...         2 2018-01-03 05:14:51

[5 rows x 27 columns]

# Running aggregations across an index
>>> df[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
     DistanceKilometers  AvgTicketPrice
sum        9.261629e+07    8.204365e+06
min        0.000000e+00    1.000205e+02
std        4.578263e+03    2.663867e+02

Machine Learning in Eland

Regression and classification

Eland allows transforming trained regression and classification models from scikit-learn, XGBoost, and LightGBM libraries to be serialized and used as an inference model in Elasticsearch.

Eland Machine Learning API documentation

Read more about Machine Learning in Elasticsearch

>>> from sklearn import datasets
>>> from xgboost import XGBClassifier
>>> from eland.ml import MLModel

# Train and exercise an XGBoost ML model locally
>>> training_data = datasets.make_classification(n_features=5)
>>> xgb_model = XGBClassifier(booster="gbtree")
>>> xgb_model.fit(training_data[0], training_data[1])

>>> xgb_model.predict(training_data[0])
[0 1 1 0 1 0 0 0 1 0]

# Import the model into Elasticsearch
>>> es_model = MLModel.import_model(
    es_client="http://localhost:9200",
    model_id="xgb-classifier",
    model=xgb_model,
    feature_names=["f0", "f1", "f2", "f3", "f4"],
)

# Exercise the ML model in Elasticsearch with the training data
>>> es_model.predict(training_data[0])
[0 1 1 0 1 0 0 0 1 0]

NLP with PyTorch

For NLP tasks, Eland allows importing PyTorch trained BERT models into Elasticsearch. Models can be either plain PyTorch models, or supported transformers models from the Hugging Face model hub.

$ eland_import_hub_model \
  --url http://localhost:9200/ \
  --hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \
  --task-type ner \
  --start

The example above will automatically start a model deployment. This is a good shortcut for initial experimentation, but for anything that needs good throughput you should omit the --start argument from the Eland command line and instead start the model using the ML UI in Kibana. The --start argument will deploy the model with one allocation and one thread per allocation, which will not offer good performance. When starting the model deployment using the ML UI in Kibana or the Elasticsearch API you will be able to set the threading options to make the best use of your hardware.

>>> import elasticsearch
>>> from pathlib import Path
>>> from eland.common import es_version
>>> from eland.ml.pytorch import PyTorchModel
>>> from eland.ml.pytorch.transformers import TransformerModel

>>> es = elasticsearch.Elasticsearch("http://elastic:mlqa_admin@localhost:9200")
>>> es_cluster_version = es_version(es)

# Load a Hugging Face transformers model directly from the model hub
>>> tm = TransformerModel(model_id="elastic/distilbert-base-cased-finetuned-conll03-english", task_type="ner", es_version=es_cluster_version)
Downloading: 100%|██████████| 257/257 [00:00<00:00, 108kB/s]
Downloading: 100%|██████████| 954/954 [00:00<00:00, 372kB/s]
Downloading: 100%|██████████| 208k/208k [00:00<00:00, 668kB/s] 
Downloading: 100%|██████████| 112/112 [00:00<00:00, 43.9kB/s]
Downloading: 100%|██████████| 249M/249M [00:23<00:00, 11.2MB/s]

# Export the model in a TorchScrpt representation which Elasticsearch uses
>>> tmp_path = "models"
>>> Path(tmp_path).mkdir(parents=True, exist_ok=True)
>>> model_path, config, vocab_path = tm.save(tmp_path)

# Import model into Elasticsearch
>>> ptm = PyTorchModel(es, tm.elasticsearch_model_id())
>>> ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)
100%|██████████| 63/63 [00:12<00:00,  5.02it/s]

Feedback 🗣️

The engineering team here at Elastic is looking for developers to participate in research and feedback sessions to learn more about how you use Eland and what improvements we can make to their design and your workflow. If you're interested in sharing your insights into developer experience and language client design, please fill out this short form. Depending on the number of responses we get, we may either contact you for a 1:1 conversation or a focus group with other developers who use the same client. Thank you in advance - your feedback is crucial to improving the user experience for all Elasticsearch developers!

eland's People

Contributors

afoucret avatar ashton-sidhu avatar bartbroere avatar benwtrent avatar blaklaybul avatar daixque avatar davidkyle avatar demjened avatar dolaru avatar droberts195 avatar edsavage avatar ezimuel avatar fju avatar iuliaferoli avatar jonathan-buttner avatar joshdevins avatar jrodewig avatar kxbin avatar lcawl avatar leemthompo avatar leonardbinet avatar mesejo avatar picandocodigo avatar pquentin avatar sakurai-youhei avatar sethmlarson avatar stevedodson avatar szabosteve avatar v1nay8 avatar valeriy42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eland's Issues

Problem with reading

I am able to establish a connection but there is some problem with reading the columns. When I do df.columns, I see an empty list. When I do df.shape, I see (260654884, 0). When I do df.info_es(), I see the following error.

`AttributeError Traceback (most recent call last)
in
----> 1 df.info_es()

~/.local/lib/python3.7/site-packages/eland/dataframe.py in info_es(self)
548 buf = StringIO()
549
--> 550 super()._info_es(buf)
551
552 return buf.getvalue()

~/.local/lib/python3.7/site-packages/eland/ndframe.py in _info_es(self, buf)
149
150 def _info_es(self, buf):
--> 151 self._query_compiler.info_es(buf)
152
153 def mean(self, numeric_only=True):

~/.local/lib/python3.7/site-packages/eland/query_compiler.py in info_es(self, buf)
471 self._index.info_es(buf)
472 self._mappings.info_es(buf)
--> 473 self._operations.info_es(self, buf)
474
475 def describe(self):

~/.local/lib/python3.7/site-packages/eland/operations.py in info_es(self, query_compiler, buf)
774 query_params, post_processing = self._resolve_tasks(query_compiler)
775 size, sort_params = Operations._query_params_to_size_and_sort(query_params)
--> 776 _source = query_compiler._mappings.get_field_names()
777
778 script_fields = query_params['query_script_fields']

~/.local/lib/python3.7/site-packages/eland/field_mappings.py in get_field_names(self, include_scripted_fields)
548 def get_field_names(self, include_scripted_fields=True):
549 if include_scripted_fields:
--> 550 return self._mappings_capabilities.es_field_name.to_list()
551
552 return self._mappings_capabilities[

~/.local/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name)
5177 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5178 return self[name]
-> 5179 return object.getattribute(self, name)
5180
5181 def setattr(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'es_field_name'`

Pandas 0.25.2 causes regressions in eland testsuite

In order to support Python 3.8, we will need to upgrade eland 's Pandas dependency to 0.25.2 from the current 0.25.1.
Keeping the dependency as 0.25.1 and using eland with Python 3.8 causes the following test failure

======================================================================================= FAILURES ========================================================================================
_________________________________________________________________________ TestDataFrameQuery.test_simple_query __________________________________________________________________________
self = <eland.tests.dataframe.test_query_pytest.TestDataFrameQuery object at 0x7febc580af70>
    def test_simple_query(self):
        ed_flights = self.ed_flights()
        pd_flights = self.pd_flights()
>       assert pd_flights.query('FlightDelayMin > 60').shape == \
               ed_flights.query('FlightDelayMin > 60').shape
eland/tests/dataframe/test_query_pytest.py:50: 


self = <pandas.core.computation.expr.PandasExprVisitor object at 0x7febc582c130>, node = <_ast.Constant object at 0x7febc582cd90>, kwargs = {'side': 'right'}, method = 'visit_Constant'
visitor = <bound method NodeVisitor.visit_Constant of <pandas.core.computation.expr.PandasExprVisitor object at 0x7febc582c130>>
    def visit(self, node, **kwargs):
        if isinstance(node, str):
            clean = self.preparser(node)
            try:
                node = ast.fix_missing_locations(ast.parse(clean))
            except SyntaxError as e:
                from keyword import iskeyword
                if any(iskeyword(x) for x in clean.split()):
                    e.msg = "Python keyword not valid identifier" " in numexpr query"
                raise e
        method = "visit_" + node.__class__.__name__
        visitor = getattr(self, method)
>       return visitor(node, **kwargs)
E       TypeError: visit_Constant() got an unexpected keyword argument 'side'
/usr/local/lib/python3.8/site-packages/pandas/core/computation/expr.py:441: TypeError

@blaklaybul found these issues which correspond to a pandas bug, which is fixed in pandas 0.25.2
pandas-dev/pandas#27261
pandas-dev/pandas#28101

After upgrading the pandas dependency to 0.25.2, the eland testsuite started failing in the following tests

=================================== FAILURES ===================================
__________________ TestDataFrameRepr.test_num_rows_repr_html ___________________
self = <eland.tests.dataframe.test_repr_pytest.TestDataFrameRepr object at 0x7f9da80ff3a0>
    def test_num_rows_repr_html(self):
        # check setup works
        assert pd.get_option('display.max_rows') == 60
        show_dimensions = pd.get_option('display.show_dimensions')
        # TODO - there is a bug in 'show_dimensions' as it gets added after the last </div>
        # For now test without this
        pd.set_option('display.show_dimensions', False)
        # Test eland.DataFrame.to_string vs pandas.DataFrame.to_string
        # In pandas calling 'to_string' without max_rows set, will dump ALL rows
        # Test n-1, n, n+1 for edge cases
>       self.num_rows_repr_html(pd.get_option('display.max_rows')-1)
eland/tests/dataframe/test_repr_pytest.py:166: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <eland.tests.dataframe.test_repr_pytest.TestDataFrameRepr object at 0x7f9da80ff3a0>
rows = 59, max_rows = None
    def num_rows_repr_html(self, rows, max_rows=None):
        ed_flights = self.ed_flights()
        pd_flights = self.pd_flights()
        ed_head = ed_flights.head(rows)
        pd_head = pd_flights.head(rows)
        ed_head_str = ed_head._repr_html_()
        pd_head_str = pd_head._repr_html_()
        #print(ed_head_str)
        #print(pd_head_str)
>       assert pd_head_str == ed_head_str
E       AssertionError: assert '<div>\n<styl...able>\n</div>' == '<div>\n<styl...able>\n</div>'
E         Skipping 496 identical leading characters in diff, use -v to show
E         Skipping 146 identical trailing characters in diff, use -v to show
E           >
E         -       <th>0</th>
E         ?         ^     ^
E         +       <td>0</td>
E         ?         ^     ^...
E         
E         ...Full output truncated (640 lines hidden), use '-vv' to show
eland/tests/dataframe/test_repr_pytest.py:186: AssertionError
___________________________ TestSeriesRepr.test_repr ___________________________
self = <eland.tests.series.test_repr_pytest.TestSeriesRepr object at 0x7f9da80fe880>
    def test_repr(self):
        pd_s = self.pd_flights()['Carrier']
        ed_s = ed.Series(ELASTICSEARCH_HOST, FLIGHTS_INDEX_NAME, 'Carrier')
        pd_repr = repr(pd_s)
        ed_repr = repr(ed_s)
>       assert pd_repr == ed_repr
E       AssertionError: assert '0         Ki...dtype: object' == '0         Ki...dtype: object'
E         Skipping 291 identical leading characters in diff, use -v to show
E         -  Carrier, dtype: object
E         +  Carrier, Length: 13059, dtype: object
E         ?           +++++++++++++++
eland/tests/series/test_repr_pytest.py:17: AssertionError
======================== 2 failed, 74 passed in 22.68s =========================

Change version numbering to support pip version patterns

The current versioning scheme does not support pip versioning patterns. Some examples from a requirements.txt:

eland==7.*
eland>=7.0

Results in:

Could not find a version that satisfies the requirement eland==7.* (from -r requirements.txt (line 19)) (from versions: 7.5.1a2, 7.5.1a3, 7.5.1a4, 7.6.0a1, 7.6.0a2, 7.6.0a3)
No matching distribution found for eland==7.* (from -r requirements.txt (line 19))

Versions like 7.6.0a3 are not semantically versioned, so the only way to just pip install the latest is to go straight to GitHub git+ssh://[email protected]/elastic/eland.git, however this gets you unreleased and possibly unstable versions as well.

Add Support for Chained Arithmetics

We currently only support arithmetics with 2 operands, i.e. the following throw an exception:

df['total_quantity'] * df['taxful_total_price'] * 0.65
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/michael/projects/eland/eland/series.py", line 678, in __mul__
    return self._numeric_op(right, _get_method_name())
  File "/Users/michael/projects/eland/eland/series.py", line 1039, in _numeric_op
    elif np.issubdtype(np.dtype(type(right)), np.number) and np.issubdtype(self._dtype, np.number):
  File "/Users/michael/projects/eland/eland/series.py", line 368, in _dtype
    return self._query_compiler.dtypes[0]
  File "/Users/michael/projects/eland/eland/query_compiler.py", line 104, in dtypes
    return self._mappings.dtypes(columns)
  File "/Users/michael/projects/eland/eland/mappings.py", line 501, in dtypes
    {key: np.dtype(self._source_field_pd_dtypes[key]) for key in field_names})
  File "/Users/michael/projects/eland/eland/mappings.py", line 501, in <dictcomp>
    {key: np.dtype(self._source_field_pd_dtypes[key]) for key in field_names})
KeyError: 'total_quantity___mul___taxful_total_price'
>>> df['customer_last_name'] + "," + df['customer_first_name']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/michael/projects/eland/eland/series.py", line 521, in __add__
    return self._numeric_op(right, _get_method_name())
  File "/Users/michael/projects/eland/eland/series.py", line 1014, in _numeric_op
    if (np.issubdtype(self._dtype, np.number) and np.issubdtype(right._dtype, np.number)):
  File "/Users/michael/projects/eland/eland/series.py", line 368, in _dtype
    return self._query_compiler.dtypes[0]
  File "/Users/michael/projects/eland/eland/query_compiler.py", line 104, in dtypes
    return self._mappings.dtypes(columns)
  File "/Users/michael/projects/eland/eland/mappings.py", line 501, in dtypes
    {key: np.dtype(self._source_field_pd_dtypes[key]) for key in field_names})
  File "/Users/michael/projects/eland/eland/mappings.py", line 501, in <dictcomp>
    {key: np.dtype(self._source_field_pd_dtypes[key]) for key in field_names})
KeyError: 'customer_last_name___add___,||str'

I think this is because of the way that we create intermediary field names to store the scripted fields. We'll need logic to understand that the latter operations rely on the first and make the appropriate field names available to subsequent ops.

UnboundLocalError Returned for dataframe filters that have no results

If the query returns no results, the following Exception is thrown. This should result in returning an empty dataframe.

eland

>>> df2 = ed.read_es('localhost:9200', 'ecommerce')
>>> df2[df2['currency']=='USD']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/michael/projects/eland/eland/dataframe.py", line 273, in __repr__
    line_width=width, show_dimensions=show_dimensions)
  File "/Users/michael/projects/eland/eland/dataframe.py", line 662, in to_string
    line_width=line_width)
  File "/Users/michael/projects/eland/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 759, in to_string
    line_width=line_width,
  File "/Users/michael/projects/eland/venv/lib/python3.7/site-packages/pandas/io/formats/format.py", line 502, in __init__
    self._chk_truncate()
  File "/Users/michael/projects/eland/venv/lib/python3.7/site-packages/pandas/io/formats/format.py", line 525, in _chk_truncate
    n_add_rows = self.header + dot_row + show_dimension_rows + prompt_row
UnboundLocalError: local variable 'show_dimension_rows' referenced before assignment

pandas

>>> pdf = ed.eland_to_pandas(df2)
>>> pdf[pdf['currency']=='USD']
Empty DataFrame
Columns: [category, currency, customer_birth_date, customer_first_name, customer_full_name, customer_gender, customer_id, customer_last_name, customer_phone, day_of_week, day_of_week_i, email, geoip.city_name, geoip.continent_name, geoip.country_iso_code, geoip.location, geoip.region_name, manufacturer, order_date, order_id, products._id, products.base_price, products.base_unit_price, products.category, products.created_on, products.discount_amount, products.discount_percentage, products.manufacturer, products.min_price, products.price, products.product_id, products.product_name, products.quantity, products.sku, products.tax_amount, products.taxful_price, products.taxless_price, products.unit_discount_amount, sku, taxful_total_price, taxless_total_price, total_quantity, total_unique_products, type, user]
Index: []

Support Elasticsearch queries

As the data is being pulled from Elasticsearch, there is a whole range of search functions available that could be exposed to allow filtering of the data retrieved.
For example, analysed text and fuzzy queries could be used to identify a corpus for analysis in eland.

We could support this through something like mirroring the elasticsearch-py search functions
https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search
https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.msearch

That way the user could choose between the pandas compatible/mirrored functions or the Elasticsearch specific extensions.

Investigate elasticsearch-py results better types conversion to pandas.DataFrame

Converting results from elasticsearch-py to a pandas DataFrame doesn't convert types effectively with this code:

(DataFrame._es_results_to_pandas)

rows = []
for hit in results['hits']['hits']:
    row = hit['_source']
    rows.append(row)
df = pd.DataFrame(data=rows)

e.g. timestamp='2018-01-01T12:09:35' is converted to object

This code converts more effectively, but is less efficient

(DataFrame._es_results_to_pandas)

rows = []
for hit in results['hits']['hits']:
    row = hit['_source']
    rows.append(row)
json_rows = json.dumps(rows)
df = pd.read_json(json_rows)

Call to eland.__version__ should show version number

If I'm working in the Python shell with pandas and want to know which version I have installed, I can do a handy call to

import pandas as pd
pd.__version__

It would be nice if eland supported a similar functionality.
At the moment, the behaviour is as follows.

Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'0.25.3'
>>> import eland as ed
>>> ed.__version__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'eland' has no attribute '__version__'
>>> 

Better Handling of Non-Aggregatable Fields in eland/mappings

In aggregatable_field_names appending .keyword to a non-aggregatable field causes a KeyError to be thrown from pandas in field_capabilities because .loc is attempting to find a field (field.keyword) that doesn't exist.

We'll need to modify the test indices to include a text field that is not aggregatable.

See #68 (comment) for more details.

Create Elastic Stack integration test for model transform classes

Integration test scenario for:

  • Creating models via some known data set.
    • These models should be the types that currently have serialization helpers
  • Gather the predicted values and probabilities (if applicable)
  • PUT the model in Elasticsearch
  • Verify that ES returns the same predicated values and probabilities (within some epsilon).

Query on existing field and value fails with AttributeError: 'numpy.bool_' object has no attribute 'empty'

Trying to query for an existing value in "winlogbeat-*", getting AttributeError

/usr/local/lib/python3.6/dist-packages/eland/query.py in exists(self, field, must)
     39         """
     40         if must:
---> 41             if self._query.empty():
     42                 self._query = NotNull(field)
     43             else:

AttributeError: 'numpy.bool_' object has no attribute 'empty'

Code:

import eland as ed
import numpy as np
import pandas as pd

from elasticsearch import Elasticsearch

es = Elasticsearch(cloud_id="XYZ", http_auth=('user', 'pwd'))
ed_df = ed.DataFrame(client=es, index_pattern="winlogbeat-*", columns=["@timestamp", "event.code", "user.name"])
ed_df.query('"user.name" == "ADMIN"')

Serializing and ingesting into ES Python data structures with numpy datatypes

Hello eland team! 👋
This is not a bug in eland per se, but I wanted to make a note of this anyway since eland is meant to be a data science native client and we use elasticsearch-py as the low level transport client, this issue will likely come up in the future anyway.

Serialisation errors with numpy datatypes

When working with data science related datasets, it's common to use numpy to apply various kinds of transformations. As a result, the final data types that will be ingested into ES are likely to contain numpy booleans or int/floats. When we attempt to ingest these with the low-level client, we get a JSON serialisation error


SerializationError: ({'mal_count': 1038, 'be_count': 2080, 'label': True}, TypeError("Unable to serialize True (type: <class 'numpy.bool_'>)"))

I've already discussed this previously with @sethmlarson and one solution would be to implement an override in the JSON serialiser (as this example https://gist.github.com/lribeiro/7d9fbedf830a54685811f63dbbce9464 demonstrates - thanks @honzakral !)

Maybe we could consider handling this out of the box for our users in eland?

Some ES date formats not supported

Current timestamp conversion:

x = pd.to_datetime(x)

pd.to_datetime(x)

does not work for some ES date formats. E.g.

"@timestamp" : {
    "type" : "date",
    "format" : "epoch_millis"
 }

@timestamp: 1484053499256 results in 1970-01-01 00:24:44.053499256 rather than 2017-01-10 13:04:59.256000 (which is correct).

Therefore, we need to check ES type and format and add conversions like pd.to_datetime(x, unit='ms')

Difference between pandas.DataFrame.min() and eland.DataFrame.min()

>>> import eland as ed
>>> edf = ed.DataFrame('localhost', 'flights')
>>> pdf = ed.eland_to_pandas(edf)
>>> pdf
       AvgTicketPrice  Cancelled           Carrier                                          Dest DestAirportID  ...                                 OriginLocation OriginRegion        OriginWeather dayOfWeek           timestamp
0          841.265642      False   Kibana Airlines  Sydney Kingsford Smith International Airport           SYD  ...        {'lon': '8.570556', 'lat': '50.033333'}        DE-HE                Sunny         0 2018-01-01 00:00:00
1          882.982662      False  Logstash Airways                     Venice Marco Polo Airport          VE05  ...  {'lon': '18.60169983', 'lat': '-33.96480179'}        SE-BD                Clear         0 2018-01-01 18:27:00
2          190.636904      False  Logstash Airways                     Venice Marco Polo Airport          VE05  ...         {'lon': '12.3519', 'lat': '45.505299'}        IT-34                 Rain         0 2018-01-01 17:11:14
3          181.694216       True   Kibana Airlines                   Treviso-Sant'Angelo Airport          TV01  ...         {'lon': '14.2908', 'lat': '40.886002'}        IT-72  Thunder & Lightning         0 2018-01-01 10:33:28
4          730.041778      False   Kibana Airlines          Xi'an Xianyang International Airport           XIY  ...        {'lon': '-99.072098', 'lat': '19.4363'}       MX-DIF        Damaging Wind         0 2018-01-01 05:13:00
...               ...        ...               ...                                           ...           ...  ...                                            ...          ...                  ...       ...                 ...
13054     1080.446279      False  Logstash Airways          Xi'an Xianyang International Airport           XIY  ...         {'lon': '10.3927', 'lat': '43.683899'}        IT-52                Sunny         6 2018-02-11 20:42:25
13055      646.612941      False  Logstash Airways                                Zurich Airport           ZRH  ...  {'lon': '-97.23989868', 'lat': '49.90999985'}        CA-MB                 Rain         6 2018-02-11 01:41:57
13056      997.751876      False  Logstash Airways                             Ukrainka Air Base          XHBU  ...        {'lon': '-99.072098', 'lat': '19.4363'}       MX-DIF                Sunny         6 2018-02-11 04:09:27
13057     1102.814465      False          JetBeats      Ministro Pistarini International Airport           EZE  ...   {'lon': '135.4380035', 'lat': '34.78549957'}        SE-BD                 Hail         6 2018-02-11 08:28:21
13058      858.144337      False          JetBeats       Washington Dulles International Airport           IAD  ...        {'lon': '138.531006', 'lat': '-34.945'}        SE-BD                 Rain         6 2018-02-11 14:54:34

[13059 rows x 27 columns]
>>> pdf.min()
AvgTicketPrice                                100.021
Cancelled                                       False
Carrier                                        ES-Air
Dest                  Abu Dhabi International Airport
DestAirportID                                     ABQ
DestCityName                                Abu Dhabi
DestCountry                                        AE
DestRegion                                       AR-B
DestWeather                                     Clear
DistanceKilometers                                  0
DistanceMiles                                       0
FlightDelay                                     False
FlightDelayMin                                      0
FlightDelayType                         Carrier Delay
FlightNum                                     00882F6
FlightTimeHour                                      0
FlightTimeMin                                       0
Origin                Abu Dhabi International Airport
OriginAirportID                                   ABQ
OriginCityName                              Abu Dhabi
OriginCountry                                      AE
OriginRegion                                     AR-B
OriginWeather                                   Clear
dayOfWeek                                           0
timestamp                         2018-01-01 00:00:00
dtype: object
>>> edf.min()
AvgTicketPrice        100.020531
Cancelled               0.000000
DistanceKilometers      0.000000
DistanceMiles           0.000000
FlightDelay             0.000000
FlightDelayMin          0.000000
FlightTimeHour          0.000000
FlightTimeMin           0.000000
dayOfWeek               0.000000
dtype: float64
>>> 

Elasticsearch only supports 'min' on numeric + bool + timestamp fields.

UnboundLocalError Thrown When Dataframe Filter Returns 0 Rows

Running calls[calls['call_duration']>8] returns the following error: UnboundLocalError: local variable 'show_dimension_rows' referenced before assignment because there are no calls that have a duration greater than 8 minutes.

From ES:

GET calls/_search
{
  "query": {
    "range": {
      "call_duration": {
        "gte": 8
      }
    }
  }
}

returns:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Mappings with many fields causes `too_long_frame_exception`

I have an index using an ECS template which has many fields (not sure exact count). When building a DataFrame off of the index with the resulting mapping from the template, I get the following failure.

It looks like there are too many fields being sent as query params to _field_caps and it is exceeding the limit in the underlying HTTP or elasticsearch-py library.

---------------------------------------------------------------------------
RequestError                              Traceback (most recent call last)
<ipython-input-5-8a0066be88df> in <module>
----> 1 df = ed.read_es('http://elastic:changeme@localhost:9200/', 'ecs-test')

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/eland/utils.py in read_es(es_params, index_pattern)
     36     eland.eland_to_pandas: Create a pandas.Dataframe from eland.DataFrame
     37     """
---> 38     return DataFrame(client=es_params, index_pattern=index_pattern)
     39 
     40 

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/eland/dataframe.py in __init__(self, client, index_pattern, columns, index_field, query_compiler)
    123             columns=columns,
    124             index_field=index_field,
--> 125             query_compiler=query_compiler)
    126 
    127     def _get_columns(self):

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/eland/ndframe.py in __init__(self, client, index_pattern, columns, index_field, query_compiler)
     53                                                 index_pattern=index_pattern,
     54                                                 columns=columns,
---> 55                                                 index_field=index_field)
     56         self._query_compiler = query_compiler
     57 

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/eland/query_compiler.py in __init__(self, client, index_pattern, columns, index_field, operations)
     50         # Get and persist mappings, this allows us to correctly
     51         # map returned types from Elasticsearch to pandas datatypes
---> 52         self._mappings = Mappings(client=self._client, index_pattern=self._index_pattern)
     53 
     54         self._index = Index(self, index_field)

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/eland/mappings.py in __init__(self, client, index_pattern, mappings)
     56             # for these names (fields=* doesn't appear to work effectively...)
     57             all_fields = Mappings._extract_fields_from_mapping(get_mapping)
---> 58             all_fields_caps = client.field_caps(index=index_pattern, fields=list(all_fields.keys()))
     59 
     60             # Get top level (not sub-field multifield) mappings

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/eland/client.py in field_caps(self, **kwargs)
     38 
     39     def field_caps(self, **kwargs):
---> 40         return self._es.field_caps(**kwargs)
     41 
     42     def count(self, **kwargs):

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/elasticsearch/client/utils.py in _wrapped(*args, **kwargs)
     82                 if p in kwargs:
     83                     params[p] = kwargs.pop(p)
---> 84             return func(*args, params=params, **kwargs)
     85 
     86         return _wrapped

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/elasticsearch/client/__init__.py in field_caps(self, index, body, params)
   1837         """
   1838         return self.transport.perform_request(
-> 1839             "GET", _make_path(index, "_field_caps"), params=params, body=body
   1840         )

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/elasticsearch/transport.py in perform_request(self, method, url, headers, params, body)
    356                     headers=headers,
    357                     ignore=ignore,
--> 358                     timeout=timeout,
    359                 )
    360 

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py in perform_request(self, method, url, params, body, timeout, ignore, headers)
    255                 method, full_url, url, body, duration, response.status, raw_data
    256             )
--> 257             self._raise_error(response.status, raw_data)
    258 
    259         self.log_request_success(

~/Documents/source/elastic/ml-search/tools/universal/metrics/venv/lib/python3.7/site-packages/elasticsearch/connection/base.py in _raise_error(self, status_code, raw_data)
    180 
    181         raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
--> 182             status_code, error_message, additional_info
    183         )
    184 

RequestError: RequestError(400, 'too_long_frame_exception', 'An HTTP line is larger than 4096 bytes.')

Problems with installing eland in a Python 3 virtual environment

Hi team,
I tried to install eland into a Python 3 virtual environment using pip.
My steps were as follows

  1. Clone the eland repo
  2. activate the Python3 virtual environment
  3. pip install -e ./ while you are in the eland repo
  4. fire up a Python 3 shell and do
import eland as ed

First error you get is this

Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import eland as ed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/camilla/development/machine-learning-data/datasets/golub-leukemia/eland/eland/__init__.py", line 12, in <module>
    from .query_compiler import *
  File "/Users/camilla/development/machine-learning-data/datasets/golub-leukemia/eland/eland/query_compiler.py", line 2, in <module>
    from modin.backends.base.query_compiler import BaseQueryCompiler
ModuleNotFoundError: No module named 'modin'

Trying to correct this by doing

pip install modin

Leads to this

Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import eland as ed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/camilla/development/machine-learning-data/datasets/golub-leukemia/eland/eland/__init__.py", line 13, in <module>
    from .plotting import *
  File "/Users/camilla/development/machine-learning-data/datasets/golub-leukemia/eland/eland/plotting.py", line 6, in <module>
    from pandas.plotting._core import (
ImportError: cannot import name '_raise_if_no_mpl' from 'pandas.plotting._core' (/Users/camilla/development/machine-learning-data/datasets/golub-leukemia/env/lib/python3.7/site-packages/pandas/plotting/_core.py)

There are a few things to improve in the user experience here:

  1. If I recall correctly, the normal pip workflow installs all of the project's dependencies too so they don't have to be installed separatelt
  2. Fixing bug issues

Increase dynamic script compilations limit to avoid test failures on CI

We have some sporadic failures on CI due to max script compilations limit

11:01:49 eland/tests/series/test_arithmetics_pytest.py:178: 
11:01:49 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:01:49 eland/tests/common.py:93: in assert_pandas_eland_series_equal
11:01:49     assert_series_equal(left, right._to_pandas(), check_less_precise=check_less_precise)
11:01:49 eland/series.py:363: in _to_pandas
11:01:49     return self._query_compiler.to_pandas()[self.name]
11:01:49 eland/query_compiler.py:388: in to_pandas
11:01:49     return self._operations.to_pandas(self)
11:01:49 eland/operations.py:516: in to_pandas
11:01:49     self._es_results(query_compiler, collector)
11:01:49 eland/operations.py:579: in _es_results
11:01:49     _source=field_names)
11:01:49 eland/client.py:37: in search
11:01:49     return self._es.search(**kwargs)
11:01:49 /usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py:84: in _wrapped
11:01:49     return func(*args, params=params, **kwargs)
11:01:49 /usr/local/lib/python3.7/site-packages/elasticsearch/client/__init__.py:811: in search
11:01:49     "GET", _make_path(index, "_search"), params=params, body=body
11:01:49 /usr/local/lib/python3.7/site-packages/elasticsearch/transport.py:358: in perform_request
11:01:49     timeout=timeout,
11:01:49 /usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py:257: in perform_request
11:01:49     self._raise_error(response.status, raw_data)
11:01:49 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:01:49 
11:01:49 self = <Urllib3HttpConnection: http://instance:9200>, status_code = 500
11:01:49 raw_data = '{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[script] Too many dynamic script compilations ... the [script.max_compilations_rate] setting","bytes_wanted":0,"bytes_limit":0,"durability":"TRANSIENT"}},"status":500}

Martijn's suggestion is to increase the max_compilations_rate

https://elastic.slack.com/archives/GLJQ32A6N/p1575284900197600

Indices with lots of fields (e.g. beats indices) cause too_long_frame_exception

I am trying to use eland with winlogbeat that has 800+ fields. I don't really want to specify which fields I am interested in looking at, since the use case for it is exploratory analysis.

This is the code I am trying to run:

import eland as ed
from elasticsearch import Elasticsearch

es = Elasticsearch(cloud_id="XYZ", http_auth=('user','passwsd'))
ed_df = ed.DataFrame(client=es, index_pattern="winlogbeat-*")
ed_df.tail()

Error:

Elasticsearch error: {'index': 'winlogbeat-*', 'size': 5, 'sort': '_doc:desc', 'body': {}, '_source': ['@timestamp', 'agent.ephemeral_id', 'agent.hostname', 'agent.id', 'agent.name', 'agent.type', 'agent.version', 'as.number', 'as.organization.name', 'blocked-user-policy.username', 'blocked-user-policy.username_blocked', 'client.address', 'client.as.number', 'client.as.organization.name', 'client.bytes', 'client.domain', 'client.geo.city_name', 'client.geo.continent_name', 'client.geo.country_iso_code', 'client.geo.country_name', 'client.geo.location', 'client.geo.name', 'client.geo.region_iso_code', 'client.geo.region_name', 'client.ip', 'client.mac', 'client.nat.ip', 'client.nat.port', 'client.packets', 'client.port', 'client.registered_domain', 'client.top_level_domain', 'client.user.domain', 'client.user.email', 'client.user.full_name', 'client.user.group.domain', 'client.user.group.id', 'client.user.group.name', 'client.user.hash', 'client.user.id', 'client.user.name', 'cloud.account.id', 'cloud.availability_zone', 'cloud.image.id', 'cloud.instance.id', 'cloud.instance.name', 'cloud.machine.type', 'cloud.project.id', 'cloud.provider', 'cloud.region', 'container.id', 'container.image.name', 'container.image.tag', 'container.name', 'container.runtime', 'destination.address', 'destination.as.number', 'destination.as.organization.name', 'destination.bytes', 'destination.domain', 'destination.geo.city_name', 'destination.geo.continent_name', 'destination.geo.country_iso_code', 'destination.geo.country_name', 'destination.geo.location', 'destination.geo.name', 'destination.geo.region_iso_code', 'destination.geo.region_name', 'destination.ip', 'destination.mac', 'destination.nat.ip', 'destination.nat.port', 'destination.packets', 'destination.port', 'destination.registered_domain', 'destination.top_level_domain', 'destination.user.domain', 'destination.user.email', 'destination.user.full_name', 'destination.user.group.domain', 'destination.user.group.id', 'destination.user.group.name', 'destination.user.hash', 'destination.user.id', 'destination.user.name', 'dns.answers.class', 'dns.answers.data', 'dns.answers.name', 'dns.answers.ttl', 'dns.answers.type', 'dns.header_flags', 'dns.id', 'dns.op_code', 'dns.question.class', 'dns.question.name', 'dns.question.registered_domain', 'dns.question.subdomain', 'dns.question.top_level_domain', 'dns.question.type', 'dns.resolved_ip', 'dns.response_code', 'dns.type', 'ecs.version', 'error.code', 'error.id', 'error.message', 'error.stack_trace', 'error.type', 'event.action', 'event.category', 'event.code', 'event.created', 'event.dataset', 'event.duration', 'event.end', 'event.hash', 'event.id', 'event.ingested', 'event.kind', 'event.module', 'event.original', 'event.outcome', 'event.provider', 'event.risk_score', 'event.risk_score_norm', 'event.sequence', 'event.severity', 'event.start', 'event.timezone', 'event.type', 'file.accessed', 'file.attributes', 'file.created', 'file.ctime', 'file.device', 'file.directory', 'file.drive_letter', 'file.extension', 'file.gid', 'file.group', 'file.hash.md5', 'file.hash.sha1', 'file.hash.sha256', 'file.hash.sha512', 'file.inode', 'file.mode', 'file.mtime', 'file.name', 'file.owner', 'file.path', 'file.size', 'file.target_path', 'file.type', 'file.uid', 'geo.city_name', 'geo.continent_name', 'geo.country_iso_code', 'geo.country_name', 'geo.location', 'geo.name', 'geo.region_iso_code', 'geo.region_name', 'group.domain', 'group.id', 'group.name', 'hash.imphash', 'hash.md5', 'hash.sha1', 'hash.sha256', 'hash.sha512', 'host.architecture', 'host.containerized', 'host.domain', 'host.geo.city_name', 'host.geo.continent_name', 'host.geo.country_iso_code', 'host.geo.country_name', 'host.geo.location', 'host.geo.name', 'host.geo.region_iso_code', 'host.geo.region_name', 'host.hostname', 'host.id', 'host.ip', 'host.mac', 'host.name', 'host.os.build', 'host.os.codename', 'host.os.family', 'host.os.full', 'host.os.kernel', 'host.os.name', 'host.os.platform', 'host.os.version', 'host.type', 'host.uptime', 'host.user.domain', 'host.user.email', 'host.user.full_name', 'host.user.group.domain', 'host.user.group.id', 'host.user.group.name', 'host.user.hash', 'host.user.id', 'host.user.name', 'http.request.body.bytes', 'http.request.body.content', 'http.request.bytes', 'http.request.method', 'http.request.referrer', 'http.response.body.bytes', 'http.response.body.content', 'http.response.bytes', 'http.response.status_code', 'http.version', 'jolokia.agent.id', 'jolokia.agent.version', 'jolokia.secured', 'jolokia.server.product', 'jolokia.server.vendor', 'jolokia.server.version', 'jolokia.url', 'kubernetes.container.image', 'kubernetes.container.name', 'kubernetes.deployment.name', 'kubernetes.namespace', 'kubernetes.node.name', 'kubernetes.pod.name', 'kubernetes.pod.uid', 'kubernetes.replicaset.name', 'kubernetes.statefulset.name', 'log.file.path', 'log.level', 'log.logger', 'log.origin.file.line', 'log.origin.file.name', 'log.origin.function', 'log.original', 'log.syslog.facility.code', 'log.syslog.facility.name', 'log.syslog.priority', 'log.syslog.severity.code', 'log.syslog.severity.name', 'message', 'mitre.technique.id', 'mitre.technique.name', 'network.application', 'network.bytes', 'network.community_id', 'network.direction', 'network.forwarded_ip', 'network.iana_number', 'network.name', 'network.packets', 'network.protocol', 'network.transport', 'network.type', 'observer.geo.city_name', 'observer.geo.continent_name', 'observer.geo.country_iso_code', 'observer.geo.country_name', 'observer.geo.location', 'observer.geo.name', 'observer.geo.region_iso_code', 'observer.geo.region_name', 'observer.hostname', 'observer.ip', 'observer.mac', 'observer.name', 'observer.os.family', 'observer.os.full', 'observer.os.kernel', 'observer.os.name', 'observer.os.platform', 'observer.os.version', 'observer.product', 'observer.serial_number', 'observer.type', 'observer.vendor', 'observer.version', 'organization.id', 'organization.name', 'os.family', 'os.full', 'os.kernel', 'os.name', 'os.platform', 'os.version', 'package.architecture', 'package.build_version', 'package.checksum', 'package.description', 'package.install_scope', 'package.installed', 'package.license', 'package.name', 'package.path', 'package.reference', 'package.size', 'package.type', 'package.version', 'policy.username', 'policy.username_blocked', 'process.args', 'process.args_count', 'process.command_line', 'process.entity_id', 'process.executable', 'process.exit_code', 'process.hash.md5', 'process.hash.sha1', 'process.hash.sha256', 'process.hash.sha512', 'process.name', 'process.parent.args', 'process.parent.args_count', 'process.parent.command_line', 'process.parent.entity_id', 'process.parent.executable', 'process.parent.exit_code', 'process.parent.name', 'process.parent.pgid', 'process.parent.pid', 'process.parent.ppid', 'process.parent.start', 'process.parent.thread.id', 'process.parent.thread.name', 'process.parent.title', 'process.parent.uptime', 'process.parent.working_directory', 'process.pgid', 'process.pid', 'process.ppid', 'process.start', 'process.thread.id', 'process.thread.name', 'process.title', 'process.uptime', 'process.working_directory', 'registry.data.bytes', 'registry.data.strings', 'registry.data.type', 'registry.hive', 'registry.key', 'registry.path', 'registry.value', 'related.ip', 'related.user', 'rule.category', 'rule.description', 'rule.id', 'rule.name', 'rule.reference', 'rule.ruleset', 'rule.uuid', 'rule.version', 'server.address', 'server.as.number', 'server.as.organization.name', 'server.bytes', 'server.domain', 'server.geo.city_name', 'server.geo.continent_name', 'server.geo.country_iso_code', 'server.geo.country_name', 'server.geo.location', 'server.geo.name', 'server.geo.region_iso_code', 'server.geo.region_name', 'server.ip', 'server.mac', 'server.nat.ip', 'server.nat.port', 'server.packets', 'server.port', 'server.registered_domain', 'server.top_level_domain', 'server.user.domain', 'server.user.email', 'server.user.full_name', 'server.user.group.domain', 'server.user.group.id', 'server.user.group.name', 'server.user.hash', 'server.user.id', 'server.user.name', 'service.ephemeral_id', 'service.id', 'service.name', 'service.node.name', 'service.state', 'service.type', 'service.version', 'source.address', 'source.as.number', 'source.as.organization.name', 'source.bytes', 'source.domain', 'source.geo.city_name', 'source.geo.continent_name', 'source.geo.country_iso_code', 'source.geo.country_name', 'source.geo.location', 'source.geo.name', 'source.geo.region_iso_code', 'source.geo.region_name', 'source.ip', 'source.mac', 'source.nat.ip', 'source.nat.port', 'source.packets', 'source.port', 'source.registered_domain', 'source.top_level_domain', 'source.user.domain', 'source.user.email', 'source.user.full_name', 'source.user.group.domain', 'source.user.group.id', 'source.user.group.name', 'source.user.hash', 'source.user.id', 'source.user.name', 'sysmon.dns.status', 'tags', 'threat.framework', 'threat.tactic.id', 'threat.tactic.name', 'threat.tactic.reference', 'threat.technique.id', 'threat.technique.name', 'threat.technique.reference', 'timeseries.instance', 'tls.cipher', 'tls.client.certificate', 'tls.client.certificate_chain', 'tls.client.hash.md5', 'tls.client.hash.sha1', 'tls.client.hash.sha256', 'tls.client.issuer', 'tls.client.ja3', 'tls.client.not_after', 'tls.client.not_before', 'tls.client.server_name', 'tls.client.subject', 'tls.client.supported_ciphers', 'tls.curve', 'tls.established', 'tls.next_protocol', 'tls.resumed', 'tls.server.certificate', 'tls.server.certificate_chain', 'tls.server.hash.md5', 'tls.server.hash.sha1', 'tls.server.hash.sha256', 'tls.server.issuer', 'tls.server.ja3s', 'tls.server.not_after', 'tls.server.not_before', 'tls.server.subject', 'tls.version', 'tls.version_protocol', 'tracing.trace.id', 'tracing.transaction.id', 'url.domain', 'url.extension', 'url.fragment', 'url.full', 'url.original', 'url.password', 'url.path', 'url.port', 'url.query', 'url.registered_domain', 'url.scheme', 'url.top_level_domain', 'url.username', 'user.domain', 'user.email', 'user.full_name', 'user.group.domain', 'user.group.id', 'user.group.name', 'user.hash', 'user.id', 'user.name', 'user_agent.device.name', 'user_agent.name', 'user_agent.original', 'user_agent.os.family', 'user_agent.os.full', 'user_agent.os.kernel', 'user_agent.os.name', 'user_agent.os.platform', 'user_agent.os.version', 'user_agent.version', 'vulnerability.category', 'vulnerability.classification', 'vulnerability.description', 'vulnerability.enumeration', 'vulnerability.id', 'vulnerability.reference', 'vulnerability.report_id', 'vulnerability.scanner.vendor', 'vulnerability.score.base', 'vulnerability.score.environmental', 'vulnerability.score.temporal', 'vulnerability.score.version', 'vulnerability.severity', 'winlog.activity_id', 'winlog.api', 'winlog.channel', 'winlog.computer_name', 'winlog.event_data.AccessList', 'winlog.event_data.AccessMask', 'winlog.event_data.AccountExpires', 'winlog.event_data.AccountName', 'winlog.event_data.AdditionalInfo', 'winlog.event_data.Address', 'winlog.event_data.AddressLength', 'winlog.event_data.AdvancedOptions', 'winlog.event_data.AlertDesc', 'winlog.event_data.AlgorithmName', 'winlog.event_data.AllowedToDelegateTo', 'winlog.event_data.AuthenticationPackageName', 'winlog.event_data.Binary', 'winlog.event_data.BitlockerUserInputTime', 'winlog.event_data.BootAppStatus', 'winlog.event_data.BootMenuPolicy', 'winlog.event_data.BootMode', 'winlog.event_data.BootType', 'winlog.event_data.BugcheckCode', 'winlog.event_data.BugcheckParameter1', 'winlog.event_data.BugcheckParameter2', 'winlog.event_data.BugcheckParameter3', 'winlog.event_data.BugcheckParameter4', 'winlog.event_data.BuildVersion', 'winlog.event_data.CallTrace', 'winlog.event_data.CallerProcessId', 'winlog.event_data.CallerProcessName', 'winlog.event_data.Checkpoint', 'winlog.event_data.Company', 'winlog.event_data.ComputerAccountChange', 'winlog.event_data.Config', 'winlog.event_data.ConfigAccessPolicy', 'winlog.event_data.Configuration', 'winlog.event_data.ConfigurationFileHash', 'winlog.event_data.ConnectedStandbyInProgress', 'winlog.event_data.CorruptionActionState', 'winlog.event_data.CreationUtcTime', 'winlog.event_data.CsEntryScenarioInstanceId', 'winlog.event_data.DCName', 'winlog.event_data.Description', 'winlog.event_data.Detail', 'winlog.event_data.Details', 'winlog.event_data.DeviceName', 'winlog.event_data.DeviceNameLength', 'winlog.event_data.DeviceObject', 'winlog.event_data.DeviceTime', 'winlog.event_data.DeviceVersionMajor', 'winlog.event_data.DeviceVersionMinor', 'winlog.event_data.DirtyPages', 'winlog.event_data.DisableIntegrityChecks', 'winlog.event_data.DisplayName', 'winlog.event_data.DnsHostName', 'winlog.event_data.DomainBehaviorVersion', 'winlog.event_data.DomainName', 'winlog.event_data.DomainPolicyChanged', 'winlog.event_data.DomainSid', 'winlog.event_data.DriveName', 'winlog.event_data.DriverName', 'winlog.event_data.DriverNameLength', 'winlog.event_data.Dummy', 'winlog.event_data.DwordVal', 'winlog.event_data.ElevatedToken', 'winlog.event_data.EnableDisableReason', 'winlog.event_data.Endpoint', 'winlog.event_data.EntryCount', 'winlog.event_data.ErrorCode', 'winlog.event_data.ErrorDescription', 'winlog.event_data.ErrorMessage', 'winlog.event_data.ErrorState', 'winlog.event_data.EventType', 'winlog.event_data.ExtensionId', 'winlog.event_data.ExtensionName', 'winlog.event_data.ExtraInfo', 'winlog.event_data.ExtraInfoLength', 'winlog.event_data.ExtraInfoString', 'winlog.event_data.FailureName', 'winlog.event_data.FailureNameLength', 'winlog.event_data.FailureReason', 'winlog.event_data.FilePath', 'winlog.event_data.FileVersion', 'winlog.event_data.FilterID', 'winlog.event_data.FinalStatus', 'winlog.event_data.FlightSigning', 'winlog.event_data.GPOCNName', 'winlog.event_data.GrantedAccess', 'winlog.event_data.Group', 'winlog.event_data.HandleId', 'winlog.event_data.HiveName', 'winlog.event_data.HiveNameLength', 'winlog.event_data.HomeDirectory', 'winlog.event_data.HomePath', 'winlog.event_data.HypervisorDebug', 'winlog.event_data.HypervisorLaunchType', 'winlog.event_data.HypervisorLoadOptions', 'winlog.event_data.IdleImplementation', 'winlog.event_data.IdleStateCount', 'winlog.event_data.ImagePath', 'winlog.event_data.ImpersonationLevel', 'winlog.event_data.IntegrityLevel', 'winlog.event_data.Interface', 'winlog.event_data.IpAddress', 'winlog.event_data.IpPort', 'winlog.event_data.IsTestConfig', 'winlog.event_data.KernelDebug', 'winlog.event_data.KeyFilePath', 'winlog.event_data.KeyLength', 'winlog.event_data.KeyName', 'winlog.event_data.KeyType', 'winlog.event_data.KeysUpdated', 'winlog.event_data.LastBootGood', 'winlog.event_data.LastShutdownGood', 'winlog.event_data.LmPackageName', 'winlog.event_data.LoadOptions', 'winlog.event_data.LockoutDuration', 'winlog.event_data.LockoutObservationWindow', 'winlog.event_data.LockoutThreshold', 'winlog.event_data.LogonGuid', 'winlog.event_data.LogonHours', 'winlog.event_data.LogonId', 'winlog.event_data.LogonProcessName', 'winlog.event_data.LogonType', 'winlog.event_data.MachineAccountQuota', 'winlog.event_data.MajorVersion', 'winlog.event_data.MandatoryLabel', 'winlog.event_data.MaximumPerformancePercent', 'winlog.event_data.MemberName', 'winlog.event_data.MemberSid', 'winlog.event_data.MinPasswordLength', 'winlog.event_data.MinimumPerformancePercent', 'winlog.event_data.MinimumThrottlePercent', 'winlog.event_data.MinorVersion', 'winlog.event_data.MixedDomainMode', 'winlog.event_data.NewProcessId', 'winlog.event_data.NewProcessName', 'winlog.event_data.NewSchemeGuid', 'winlog.event_data.NewSd', 'winlog.event_data.NewSize', 'winlog.event_data.NewTargetUserName', 'winlog.event_data.NewTime', 'winlog.event_data.NewUACList', 'winlog.event_data.NewUacValue', 'winlog.event_data.NominalFrequency', 'winlog.event_data.Number', 'winlog.event_data.NumberOfGroupPolicyObjects', 'winlog.event_data.OS EditionID', 'winlog.event_data.OS Name', 'winlog.event_data.OS build version', 'winlog.event_data.OS major version', 'winlog.event_data.OS minor version', 'winlog.event_data.OS service pack major version', 'winlog.event_data.OS service pack minor version', 'winlog.event_data.ObjectName', 'winlog.event_data.ObjectServer', 'winlog.event_data.ObjectType', 'winlog.event_data.OemInformation', 'winlog.event_data.OldSchemeGuid', 'winlog.event_data.OldTargetUserName', 'winlog.event_data.OldTime', 'winlog.event_data.OldUacValue', 'winlog.event_data.Operation', 'winlog.event_data.OperationType', 'winlog.event_data.OriginalFileName', 'winlog.event_data.OriginalSize', 'winlog.event_data.PackageName', 'winlog.event_data.PasswordHistoryLength', 'winlog.event_data.PasswordLastSet', 'winlog.event_data.PasswordProperties', 'winlog.event_data.Path', 'winlog.event_data.PerformanceImplementation', 'winlog.event_data.PowerButtonTimestamp', 'winlog.event_data.PreAuthType', 'winlog.event_data.PreviousCreationUtcTime', 'winlog.event_data.PreviousTime', 'winlog.event_data.PrimaryGroupId', 'winlog.event_data.PrivilegeList', 'winlog.event_data.ProcessId', 'winlog.event_data.ProcessName', 'winlog.event_data.ProcessPath', 'winlog.event_data.ProcessPid', 'winlog.event_data.ProcessingMode', 'winlog.event_data.ProcessingTimeInMilliseconds', 'winlog.event_data.Product', 'winlog.event_data.ProfilePath', 'winlog.event_data.Properties', 'winlog.event_data.ProtocolType', 'winlog.event_data.ProviderName', 'winlog.event_data.PuaCount', 'winlog.event_data.PuaPolicyId', 'winlog.event_data.QfeVersion', 'winlog.event_data.QueryName', 'winlog.event_data.Reason', 'winlog.event_data.RemoteEventLogging', 'winlog.event_data.RestrictedAdminMode', 'winlog.event_data.RetryMinutes', 'winlog.event_data.ReturnCode', 'winlog.event_data.RuleName', 'winlog.event_data.SamAccountName', 'winlog.event_data.SchemaVersion', 'winlog.event_data.ScriptBlockText', 'winlog.event_data.ScriptPath', 'winlog.event_data.ServiceName', 'winlog.event_data.ServicePrincipalNames', 'winlog.event_data.ServiceSid', 'winlog.event_data.ServiceType', 'winlog.event_data.ServiceVersion', 'winlog.event_data.ShutdownActionType', 'winlog.event_data.ShutdownEventCode', 'winlog.event_data.ShutdownReason', 'winlog.event_data.SidHistory', 'winlog.event_data.Signature', 'winlog.event_data.SignatureStatus', 'winlog.event_data.Signed', 'winlog.event_data.SleepInProgress', 'winlog.event_data.StartTime', 'winlog.event_data.StartType', 'winlog.event_data.State', 'winlog.event_data.Status', 'winlog.event_data.StopTime', 'winlog.event_data.SubStatus', 'winlog.event_data.SubjectDomainName', 'winlog.event_data.SubjectLogonId', 'winlog.event_data.SubjectUserName', 'winlog.event_data.SubjectUserSid', 'winlog.event_data.SupportInfo1', 'winlog.event_data.SupportInfo2', 'winlog.event_data.SystemSleepTransitionsToOn', 'winlog.event_data.TSId', 'winlog.event_data.TargetDomainName', 'winlog.event_data.TargetImage', 'winlog.event_data.TargetInfo', 'winlog.event_data.TargetLinkedLogonId', 'winlog.event_data.TargetLogonGuid', 'winlog.event_data.TargetLogonId', 'winlog.event_data.TargetObject', 'winlog.event_data.TargetOutboundDomainName', 'winlog.event_data.TargetOutboundUserName', 'winlog.event_data.TargetProcessGUID', 'winlog.event_data.TargetProcessId', 'winlog.event_data.TargetServerName', 'winlog.event_data.TargetSid', 'winlog.event_data.TargetUserName', 'winlog.event_data.TargetUserSid', 'winlog.event_data.TerminalSessionId', 'winlog.event_data.TestSigning', 'winlog.event_data.TicketEncryptionType', 'winlog.event_data.TicketOptions', 'winlog.event_data.TimeProvider', 'winlog.event_data.TimeSource', 'winlog.event_data.TokenElevationType', 'winlog.event_data.TransmittedServices', 'winlog.event_data.UnsynchronizedTimeSeconds', 'winlog.event_data.UpdateType', 'winlog.event_data.Url', 'winlog.event_data.UserAccountControl', 'winlog.event_data.UserParameters', 'winlog.event_data.UserPrincipalName', 'winlog.event_data.UserSid', 'winlog.event_data.UserWorkstations', 'winlog.event_data.Version', 'winlog.event_data.VirtualAccount', 'winlog.event_data.VsmLaunchType', 'winlog.event_data.VsmPolicy', 'winlog.event_data.Workstation', 'winlog.event_data.error', 'winlog.event_data.param1', 'winlog.event_data.param10', 'winlog.event_data.param11', 'winlog.event_data.param12', 'winlog.event_data.param2', 'winlog.event_data.param3', 'winlog.event_data.param4', 'winlog.event_data.param5', 'winlog.event_data.param6', 'winlog.event_data.param7', 'winlog.event_data.param8', 'winlog.event_data.param9', 'winlog.event_data.restarttime', 'winlog.event_data.serviceGuid', 'winlog.event_data.spn1', 'winlog.event_data.spn2', 'winlog.event_data.updateGuid', 'winlog.event_data.updateRevisionNumber', 'winlog.event_data.updateTitle', 'winlog.event_data.updatelist', 'winlog.event_id', 'winlog.keywords', 'winlog.logon.failure.reason', 'winlog.logon.failure.status', 'winlog.logon.failure.sub_status', 'winlog.logon.id', 'winlog.logon.type', 'winlog.opcode', 'winlog.process.pid', 'winlog.process.thread.id', 'winlog.provider_guid', 'winlog.provider_name', 'winlog.record_id', 'winlog.related_activity_id', 'winlog.task', 'winlog.user.domain', 'winlog.user.identifier', 'winlog.user.name', 'winlog.user.type', 'winlog.user_data.AddServiceStatus', 'winlog.user_data.Channel', 'winlog.user_data.DeviceInstanceID', 'winlog.user_data.DriverDescription', 'winlog.user_data.DriverFileName', 'winlog.user_data.DriverName', 'winlog.user_data.DriverProvider', 'winlog.user_data.DriverVersion', 'winlog.user_data.InstallStatus', 'winlog.user_data.IsDriverOEM', 'winlog.user_data.PrimaryService', 'winlog.user_data.Reason', 'winlog.user_data.RebootOption', 'winlog.user_data.RmSessionId', 'winlog.user_data.ServiceName', 'winlog.user_data.SetupClass', 'winlog.user_data.SubjectDomainName', 'winlog.user_data.SubjectLogonId', 'winlog.user_data.SubjectUserName', 'winlog.user_data.SubjectUserSid', 'winlog.user_data.UTCStartTime', 'winlog.user_data.UpdateService', 'winlog.user_data.UpgradeDevice', 'winlog.user_data.binaryData', 'winlog.user_data.binaryDataSize', 'winlog.user_data.param1', 'winlog.user_data.param2', 'winlog.user_data.xml_name', 'winlog.version']}
---------------------------------------------------------------------------
RequestError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/IPython/core/formatters.py in __call__(self, obj)
    697                 type_pprinters=self.type_printers,
    698                 deferred_pprinters=self.deferred_printers)
--> 699             printer.pretty(obj)
    700             printer.flush()
    701             return stream.getvalue()

15 frames
/usr/local/lib/python3.6/dist-packages/elasticsearch/connection/base.py in _raise_error(self, status_code, raw_data)
    242 
    243         raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
--> 244             status_code, error_message, additional_info
    245         )
    246 

RequestError: RequestError(400, 'too_long_frame_exception', 'An HTTP line is larger than 4096 bytes.')

Installed versions:
pip install eland
Output:

Requirement already satisfied: eland in /usr/local/lib/python3.6/dist-packages (7.6.0a3)
Requirement already satisfied: elasticsearch>=7.0.5 in /usr/local/lib/python3.6/dist-packages (from eland) (7.6.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from eland) (3.2.0)
Requirement already satisfied: pandas==0.25.3 in /usr/local/lib/python3.6/dist-packages (from eland) (0.25.3)
Requirement already satisfied: urllib3>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from elasticsearch>=7.0.5->eland) (1.24.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->eland) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->eland) (2.4.6)
Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.6/dist-packages (from matplotlib->eland) (1.18.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->eland) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->eland) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas==0.25.3->eland) (2018.9)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from kiwisolver>=1.0.1->matplotlib->eland) (46.0.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib->eland) (1.12.0)

Investigate using building eland based on modin

Modin is a project that allows a user to "Scale your pandas workflows by changing one line of code".

https://github.com/modin-project/modin
https://modin.readthedocs.io/
https://www.youtube.com/watch?v=-HjLd_3ahCw
https://rise.cs.berkeley.edu/blog/modin-pandas-on-ray-october-2018/

The implementation abstracts the pandas Dataframe and Series API from the underlying storage.

Initial investigations into implementing eland show that we require a similar architecture. In fact, several eland implementation details are already similar to modin. For example, our implementation of __repr__ is similar to modin.pandas.BasePandasDataset._build_repr_df.

This issue discusses if we could create a new modin engine/backend based on Elasticsearch.

Note, this project also studied the most common pandas methods used in Kaggle notebooks to rank which methods to implement first: https://github.com/modin-project/study_kaggle_usage, https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/ - this gives us a good goal.

Trim output when printing dataframe

Right now, printing an eland dataframe to the console outputs 61 rows of data:

df.head(30)
...
df.tail(30)

We should follow pandas here and output the first 5 records and the last 5 records.

eland compatibility issues with Python 3.5.3

Since Pandas supports Python 3.5.3 and above from the 3.5.x Python series, we might want to be able to support that in eland too. Running the eland pytest suite against Python 3.5.3 and Pandas 0.25.1 in a Docker container results in numerous failures. I'll post a few example stacktraces below. For the full dump of the failures, please take a look at the attached .txt file.

TestDataFrameAggs.test_basic_aggs

=================================== FAILURES ===================================
______________________ TestDataFrameAggs.test_basic_aggs _______________________

self = <eland.tests.dataframe.test_aggs_pytest.TestDataFrameAggs object at 0x7f61b46a2898>

    def test_basic_aggs(self):
        pd_flights = self.pd_flights()
        ed_flights = self.ed_flights()
    
        pd_sum_min = pd_flights.select_dtypes(include=[np.number]).agg(['sum', 'min'])
>       ed_sum_min = ed_flights.select_dtypes(include=[np.number]).agg(['sum', 'min'])

eland/tests/dataframe/test_aggs_pytest.py:16: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
eland/dataframe.py:974: in aggregate
    return self._query_compiler.aggs(func)
eland/query_compiler.py:393: in aggs
    return self._operations.aggs(self, func)
eland/operations.py:419: in aggs
    body=body.to_search_body())
eland/client.py:37: in search
    return self._es.search(**kwargs)
/usr/local/lib/python3.5/site-packages/elasticsearch/client/utils.py:84: in _wrapped
    return func(*args, params=params, **kwargs)
/usr/local/lib/python3.5/site-packages/elasticsearch/client/__init__.py:811: in search
    "GET", _make_path(index, "_search"), params=params, body=body
/usr/local/lib/python3.5/site-packages/elasticsearch/transport.py:358: in perform_request
    timeout=timeout,
/usr/local/lib/python3.5/site-packages/elasticsearch/connection/http_urllib3.py:257: in perform_request
    self._raise_error(response.status, raw_data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Urllib3HttpConnection: http://instance:9200>, status_code = 400
raw_data = '{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Expected numeric type on field [DestRegion], b...illegal_argument_exception","reason":"Expected numeric type on field [DestRegion], but got [keyword]"}}},"status":400}'

    def _raise_error(self, status_code, raw_data):
        """ Locate appropriate exception and raise it. """
        error_message = raw_data
        additional_info = None
        try:
            if raw_data:
                additional_info = json.loads(raw_data)
                error_message = additional_info.get("error", error_message)
                if isinstance(error_message, dict) and "type" in error_message:
                    error_message = error_message["type"]
        except (ValueError, TypeError) as err:
            logger.warning("Undecodable raw error response from server: %s", err)
    
        raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
>           status_code, error_message, additional_info
        )
E       elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'Expected numeric type on field [DestRegion], but got [keyword]')

/usr/local/lib/python3.5/site-packages/elasticsearch/connection/base.py:182: RequestError
------------------------------ Captured log call -------------------------------
WARNING  elasticsearch:base.py:150 GET http://instance:9200/flights/_search?size=0 [status:400 request:0.162s]

TestDataFrameCount.test_ecommerce_count

___________________ TestDataFrameCount.test_ecommerce_count ____________________

self = <eland.tests.dataframe.test_count_pytest.TestDataFrameCount object at 0x7f61b41a79b0>

    def test_ecommerce_count(self):
        pd_ecommerce = self.pd_ecommerce()
        ed_ecommerce = self.ed_ecommerce()
    
        pd_count = pd_ecommerce.count()
        ed_count = ed_ecommerce.count()
    
>       assert_series_equal(pd_count, ed_count)

eland/tests/dataframe/test_count_pytest.py:18: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError: Series.index are different
E   
E   Series.index values are different (100.0 %)
E   [left]:  Index(['category', 'currency', 'customer_birth_date', 'customer_first_name', 'customer_full_name',
E          'customer_gender', 'customer_id', 'customer_last_name', 'customer_phone', 'day_of_week',
E          'day_of_week_i', 'email', 'geoip.city_name', 'geoip.continent_name',
E          'geoip.country_iso_code', 'geoip.location', 'geoip.region_name', 'manufacturer',
E          'order_date', 'order_id', 'products._id', 'products.base_price', 'products.base_unit_price',
E          'products.category', 'products.created_on', 'products.discount_amount',
E          'products.discount_percentage', 'products.manufacturer', 'products.min_price',
E          'products.price', 'products.product_id', 'products.product_name', 'products.quantity',
E          'products.sku', 'products.tax_amount', 'products.taxful_price', 'products.taxless_price',
E          'products.unit_discount_amount', 'sku', 'taxful_total_price', 'taxless_total_price',
E          'total_quantity', 'total_unique_products', 'type', 'user'],
E         dtype='object')
E   [right]: Index(['geoip.region_name', 'total_unique_products', 'user', 'day_of_week', 'customer_birth_date',
E          'products.created_on', 'total_quantity', 'day_of_week_i', 'products.product_name',
E          'products.quantity', 'products.category', 'geoip.city_name', 'order_id',
E          'products.taxful_price', 'products.discount_amount', 'products._id', 'customer_id',
E          'customer_phone', 'products.product_id', 'products.taxless_price', 'geoip.country_iso_code',
E          'customer_full_name', 'customer_gender', 'taxful_total_price', 'category',
E          'taxless_total_price', 'products.tax_amount', 'products.base_price', 'sku',
E          'products.discount_percentage', 'email', 'manufacturer', 'products.price',
E          'products.min_price', 'order_date', 'geoip.location', 'currency', 'type', 'products.sku',
E          'products.unit_discount_amount', 'customer_first_name', 'products.base_unit_price',
E          'products.manufacturer', 'customer_last_name', 'geoip.continent_name'],
E         dtype='object')

pandas/_libs/testing.pyx:178: AssertionError

TestDataFrameDateTime.test_datetime_to_ms

__________________ TestDataFrameDateTime.test_datetime_to_ms ___________________

self = <eland.tests.dataframe.test_datetime_pytest.TestDataFrameDateTime object at 0x7f61b1d50438>

    def test_datetime_to_ms(self):
        df = pd.DataFrame(data={'A': np.random.rand(3),
                                'B': 1,
                                'C': 'foo',
                                'D': pd.Timestamp('20190102'),
                                'E': [1.0, 2.0, 3.0],
                                'F': False,
                                'G': [1, 2, 3]},
                          index=['0', '1', '2'])
    
        expected_mappings = {'mappings': {
            'properties': {'A': {'type': 'double'},
                           'B': {'type': 'long'},
                           'C': {'type': 'keyword'},
                           'D': {'type': 'date'},
                           'E': {'type': 'double'},
                           'F': {'type': 'boolean'},
                           'G': {'type': 'long'}}}}
    
        mappings = ed.Mappings._generate_es_mappings(df)
    
        assert expected_mappings == mappings
    
        # Now create index
        index_name = 'eland_test_generate_es_mappings'
    
        ed_df = ed.pandas_to_eland(df, ELASTICSEARCH_HOST, index_name, if_exists="replace", refresh=True)
        ed_df_head = ed_df.head()
    
>       assert_pandas_eland_frame_equal(df, ed_df_head)

eland/tests/dataframe/test_datetime_pytest.py:43: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
eland/tests/common.py:67: in assert_pandas_eland_frame_equal
    assert_frame_equal(left, right._to_pandas())
pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError: DataFrame.columns are different
E   
E   DataFrame.columns values are different (85.71429 %)
E   [left]:  Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
E   [right]: Index(['G', 'F', 'E', 'D', 'C', 'B', 'A'], dtype='object')

pandas/_libs/testing.pyx:178: AssertionError

Add support for arithmetic operations on string Series

In pandas, str + str, and str * int are supported:

>>> df = pd.DataFrame(data={'col1': [1,2], 'col2': ['a','b'], 'col3': ['c', 'd']})
>>> df.dtypes
col1     int64
col2    object
col3    object
dtype: object
>>> df
   col1 col2 col3
0     1    a    c
1     2    b    d
>>> df.col1 * df.col2
0     a
1    bb
dtype: object
>>> df.col2 * df.col1
0     a
1    bb
dtype: object
>>> df.col2 + df.col3
0    ac
1    bd
dtype: object

In eland arithmetic operations on strings raise a TypeError. TODO - add compatibility with pandas.

See eland.series.Series._numeric_op for details e.g. add

if not (
                        method_name == '__add__' and
                        self._dtype.kind in {'O', 'U', 'S'} and
                        right._dtype.kind in {'O', 'U', 'S'}
                ):
...

Boolean indexing fails

I have a small generated dataset from bin/simulate in https://github.com/elastic/ml-search/tree/master/tools/universal/metrics. When I attempt to use a boolean filter on event.action, the query fails with a very long stack trace that I won't include but it ends with:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Checking info_es() also looked fine and this same filter works on other fields like event.category.

{'term': {'event.action': 'query'}}
index_pattern: ecs-test
Index:
 index_field: _id
 is_source_field: False
Mappings:
 capabilities:                           _source es_dtype        pd_dtype  searchable  \
@timestamp                   True     date  datetime64[ns]        True   
ecs.version                  True  keyword          object        True   
event.action                 True  keyword          object        True   
event.category               True  keyword          object        True   
event.created                True     date  datetime64[ns]        True   
event.id                     True  keyword          object        True   
labels.click.result.id       True     long           int64        True   
labels.click.result.rank     True     long           int64        True   
labels.query.id              True  keyword          object        True   
labels.query.value           True  keyword          object        True   
labels.results               True     long           int64        True   
source.user.id               True  keyword          object        True   

                          aggregatable  
@timestamp                        True  
ecs.version                       True  
event.action                      True  
event.category                    True  
event.created                     True  
event.id                          True  
labels.click.result.id            True  
labels.click.result.rank          True  
labels.query.id                   True  
labels.query.value                True  
labels.results                    True  
source.user.id                    True  
Operations:
 tasks: [('boolean_filter', {'term': {'event.action': 'query'}}), ('head', ('_doc', 5))]
 size: 5
 sort_params: _doc:asc
 columns: None
 post_processing: []

Relevant parts of the notebook looked something like:

import eland as ed
df = ed.read_es('http://elastic:changeme@localhost:9200/', 'ecs-test')

df[df['event.action'] == 'query']

Happy to help with getting the simulated date into an ES instance but it should be pretty straightforward. Just check the -h on the bin/simulate script.

Next steps priorities - thoughts

Here are some features that have been discussed as priorities. Please raise more:

  1. Setting DateTimeIndex and then filtering and plotting DataFrame with x-axis as time (potentially smart sampling of data for large datasets). E.g.
ed_df = ed.DataFrame('localhost', 'flights')
ed_df.set_index('timestamp', inplace=True)
ed_df['2018-01-01':'2018-01-11'].plot()

image

  1. Implementing DataFrame.groupby and calling ES transform as required:
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0
  1. Transforming a DataFrame field inside a DataFrame e.g.

ed_df['total_price'] = ed_df['total_price']*10

  1. Running ML jobs from eland (anomaly detection + DFA)

Refactor module names to comply with PEP8

Hi folks - we should probably commit to following PEP8 (or another style guide) as early as possible.
The module names - Client.py etc should be renamed to be all lowercase

From the PEP8 specification

Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.

When an extension module written in C or C++ has an accompanying Python module that provides a higher level (e.g. more object oriented) interface, the C/C++ module has a leading underscore (e.g. _socket).

Add support for value counts

I would like to return a series that tells me the frequency distribution of a series with dtype object. The pandas method is series.value_counts() or df[col].value_counts(). There is an optional boolean parameter norms that allows the normalized frequency to be returned instead of the counts.

Fix xpack testsuite failures on CI

[more info soon]

During handling of the above exception, another exception occurred:
eland/tests/series/test_value_counts_pytest.py:5: in <module>
    from eland.tests.common import TestData
eland/tests/common.py:20: in <module>
    _ed_flights = ed.read_es(ELASTICSEARCH_HOST, FLIGHTS_INDEX_NAME)
eland/utils.py:37: in read_es
    return DataFrame(client=es_params, index_pattern=index_pattern)
eland/dataframe.py:119: in __init__
    query_compiler=query_compiler)
eland/ndframe.py:53: in __init__
    index_field=index_field)
eland/query_compiler.py:46: in __init__
    self._mappings = Mappings(client=self._client, index_pattern=self._index_pattern)
eland/mappings.py:54: in __init__
    get_mapping = client.get_mapping(index=index_pattern)
eland/client.py:28: in get_mapping
    return self._es.indices.get_mapping(**kwargs)
/usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py:84: in _wrapped
    return func(*args, params=params, **kwargs)
/usr/local/lib/python3.7/site-packages/elasticsearch/client/indices.py:389: in get_mapping
    "GET", _make_path(index, "_mapping", doc_type), params=params
/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py:358: in perform_request
    timeout=timeout,
/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py:247: in perform_request
    raise SSLError("N/A", str(e), e)
E   elasticsearch.exceptions.SSLError: ConnectionError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)) caused by: SSLError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076))
!!!!!!!!!!!!!!!!!!! Interrupted: 36 errors during collection !!!!!!!!!!!!!!!!!!!
============================= 36 errors in 38.22s ==============================

If an ES field has some missing values, eland substitutes _id for these values instead of leaving them empty

Eland version

eland-7.6.0a3
Installed through pip.

Steps to reproduce

  1. Create an index where a field is missing values in some documents - see example below. Some documents have an empty string for the field BeanType

Screen Shot 2020-03-06 at 12 14 04 PM

  1. Read this index into eland
# establish connection to elasticsearch and read-in the dataset
es_client = Elasticsearch(host=ES_HOST, port=ES_PORT)
eland_df = ed.DataFrame(client=es_client, index_pattern=INDEX_NAME)
  1. Check the dataset using head() and you will see that the column BeanType is populated with the values from the Elasticsearch _id field

Screen Shot 2020-03-06 at 12 16 08 PM

Support unique() on Series

We currently only support nunique() but it would be nice to support unique() as well to see the actual values. I guess this would just be a terms aggregation when the field is a keyword.

DataFrame metrics aggs not ignoring text fields

Below is an example using DataFrame.max(). The same error occurs with all computational dataframe methods.

Once this issue is addressed, we should add tests to ensure that aggregations only accept the appropriate data types.

>>> import eland as ed
>>> columns = ['category', 'currency', 'customer_birth_date', 'customer_first_name', 'user']
>>> df = ed.DataFrame('localhost', 'ecommerce', columns=columns)
>>> df.max()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/michael/eland/eland/ndframe.py", line 420, in max
    return self._query_compiler.max()
  File "/Users/michael/eland/eland/query_compiler.py", line 409, in max
    return self._operations.max(self)
  File "/Users/michael/eland/eland/operations.py", line 127, in max
    return self._metric_aggs(query_compiler, 'max')
  File "/Users/michael/eland/eland/operations.py", line 157, in _metric_aggs
    body=body.to_search_body())
  File "/Users/michael/eland/eland/client.py", line 37, in search
    return self._es.search(**kwargs)
  File "/Users/michael/eland/venv/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 84, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Users/michael/eland/venv/lib/python3.7/site-packages/elasticsearch/client/__init__.py", line 811, in search
    "GET", _make_path(index, "_search"), params=params, body=body
  File "/Users/michael/eland/venv/lib/python3.7/site-packages/elasticsearch/transport.py", line 358, in perform_request
    timeout=timeout,
  File "/Users/michael/eland/venv/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 257, in perform_request
    self._raise_error(response.status, raw_data)
  File "/Users/michael/eland/venv/lib/python3.7/site-packages/elasticsearch/connection/base.py", line 182, in _raise_error
    status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'Fielddata is disabled on text fields by default. Set fielddata=true on [category] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.')

Class naming convention

I believe that the parentheses after the classname are unnecessary in Python 3.x

from elasticsearch import Elasticsearch

class Client():
    """
    eland client - implemented as facade to control access to Elasticsearch methods
    """
    def __init__(self, es=None):
        if isinstance(es, Elasticsearch):
            self.es = es
        else:
            self.es = Elasticsearch(es)
            

Should be something like

class Client:
#stuff

I believe this might be a vestigial remain of the old style vs new style classes that existed in earlier versions of python
At least looking at the docs for 3.7.4, the class names don't have parenthesis unless the class is inheriting from a superclass.

Objects in eland namespace not imported into Jupyter notebook

Steps to reproduce

  1. Create Python3 virtual environment
  2. Install jupyter, then install eland into the same environment using pip
  3. Import eland into a jupyter cell
import eland as ed
  1. Try calling any function that is supposed to be in the top-level eland namespace and you will get a NameError
  2. If you use dir to examine the objects that are present in the namespace, you'll notice that there are none, which suggests init.py file is not being executed properly.

Things to tidy up in the README.md

Some things to fix in the eland README to make usage easier:

  • Add supported ES versions somewhere so users know if they can safely use this

  • Rename this https://github.com/elastic/eland#development-setup section to "Quickstart" or something similar. Development Setup should be used in a section that's mean for people who want to contribute/make a change to the eland codebase, not for people who just want to use it.

  • For Development Setup, include guidance on how to run tests for eland.

  • Add a Contributing section/doc that outlines the contributor policy (check the elasticsearch-py for similar guidelines)

Add progress feedback to long operations

If operations such as pd_df = ed.eland_to_pandas(ed_df) take a long time, then there is no feedback to users.

It would be good to add optional feedback to stdout. E.g. ed.eland_to_pandas(ed_df, show_progress=True, progress_interval=10000)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.