redpanda-ai / meerkat Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 595.79 MB

Used for the Meerkat project

License: Other

Shell 0.37% Python 99.63%

meerkat's People

Contributors

Stargazers

Watchers

Forkers

trellixvulnteam

meerkat's Issues

testAccuracy.py produces 11 PyLint errors

You should be able to see these in the "Problems" tab and as yellow yield signs on your editor.

Additionally, the names testAccuracy.py and mergeResults.py should be renamed to test_accuracy.py and merge_results.py.

Reserved words "AND" and "OR" fail when submitted by themselves to __search_index

Check description_consumer.py:

We should keep a list of "reserved words" and offer substitutes, perhaps lowercased versions while searching for unigrams.

Support output to a file

Configure the name of the file to output to within the config_file as a JSON entry.

Restore the following functions and add n-grams (n >=2) back to the final bool query

We need get_n_gram_tokens and serach_n_gram_tokens (see line 80 in description_consumer.

Address 404 errors; distribute queries to cluster nodes.

#16 ElasticSearch 400 Errors halt execution c4f7e55
#17 ElasticSearch 400 issue on the following description strings. 56b304a
#21 We need to set up a distributed index on ElasticSearch 56b304a
#22 We need to recruit multiple ES nodes for our tokenizer 56b304a

Use git move to update our project directory structure

Move all configuration files to ">config".
Move all data files to ">data".
Throw all logs to ">logs" (not source controlled).
Rename configuration files to be more descriptive.

This could be good advice on the directory structure
http://learnpythonthehardway.org/book/ex46.html

Re: tokenize_descriptions output

We need to figure out how to output our results. Some possibilities:

Human readable for demoing to others (how it is now)
A list of extracted features
The top result with only the top fields listed
Some n number of top results.
Just the persistentrecordid
Leave human output, but allow to pass in a file to output any of the top or other formats

ElasticSearch has a python api, let's convert all interaction to our ES index to use this api.

https://github.com/elasticsearch/elasticsearch-py

Deindex unuseful fields to improve speed

Fields that are obviously unuseful should be deindexed to improve performance.

New Feature ideas (please decompose into specific issues).

My thought was to leave the current printing scheme and instead include a more elaborate available set of arguments to pass in. Perhaps with one option being to export the bare results to a specific file. Or, we could have some function like conditional print, where if you pass an argument it will only print the top result.

But realistically wwapping out the prints for logs the first time didn't really work that well and I don't want to do it again. I think it would be a better idea to have a more robust set of arguments to pass with the string, one which would allow you to pass a filename which it would append the top result to.

We need to set up a distributed index on ElasticSearch

We are unable to test routing queries to multiple nodes until this is in place.

mergeResults.py produces 13 PyLint warnings

You should be able to see these in the "Problems" tab and as yellow yield signs on your editor.

2014-01-03 Brainstorming

Should we consider "feature engineering" or "feature learning" to be the correct path? UNKNOWN
What algorithms are available for "feature learning"? https://en.wikipedia.org/wiki/Feature_learning
What algorithms are available for "feature engineering"? UNKNOWN
What is our timeline for finishing our basic toolset and moving on to:
a. building a machine learning algorithm (OPEN)
How much time would be spent on research? (ESTIMATED)
How much time would be spent on development? (ESTIMATED)
Are we hoping to build up the system incrementally? YES
How can text be encoded into numerical values? (OPEN)
Can we conduct a git exercise to allow independent feature development that will easily merge? What are the steps involved? More importantly, when are we going to do them? (COMPLETED)

Write unit tests for various_tools

Enforce unittest run on merge into master

Subtasks:

Move accuracy tests into test suite
Include pylint tests into test suite
Have these run when "python3 -m unittest" is called.

new_bulk_loader branch

Will allow us to efficiently index 25 million records into ElaticSearch.
There will be special consideration for geo coordinate mappings to ensure that we can support search regions bounded by polygons.

Enable tuning the confidence threshold at which a merchant is labeled.

Both z_score and confidence score thresholds should be tunable. Transactions that don't match these thresholds aren't labeled. These should be the first two features to go into the feature vector.

Label transaction data

Prerequisite #60

Label transaction data (Matt, Andy)
- At least 10,000 descriptions, to satisfy 95% confidence interval with 1% margin of error
- Labelled as physical or not-physical
- Labelled as unique id for factual merchant

new factual.com description_consumer and description_producer

Requirements
- Needs to deal with factual.com as it was loaded by the new_bulk_loader module.
- Will ultimately replace our existing description_consumer and description_producer modules.
- Will deal with composite features in a generic way involving loading composite definitions from a configuration file.

Use the configuration file to control the number of search results

size: 1

We need to recruit multiple ES nodes for our tokenizer

Each DescriptionConsumer thread should be assigned its own ES node for all queries.
A list containing at least one ES node must be provided within a configuration file.
If an ES node is not provided, throw an error and abort.

description_producer.py Unit Tests

Need more verbose error messages on Params load.

I don't know if verbose is the right word. Whatever.

Point is, if loading anything in the config file fails, it just throws a "can't find file" error. This lead to some confusion and should be fixed.

We should provide the means for node discovery based upon a list of possible nodes from within a configuration file.

After it discovers the complete list of nodes, we need to add that to the "params" structure that is passed into the DescriptionConsumer.init constructor.

Each DescriptionConsumer will have its own thread_id, and the following formula will be used to determine which ElasticSearch node is contacted by that thread:

self.my_es_node = thread_id % len(elastc_search_nodes)

"Tokenizer" needs a way to process an entire file of description strings.

"write_output" function fails on bad configuration file.

Please validate the JSON config file before processing begins.

Here are logs demonstrating how complete processing takes place before throwing an Exception because the "output.file" key was not found in the JSON config file.

STEP TO REPRODUCE:

Run __init__.py using config/default.json

EXPECTED RESULT:

An informative error message stating that the configuration file must have an "output.file" key and value; the program should then immediately halt without doing any actual work.

ACTUAL RESULT:

2014-01-22 17:21:49,808 - thread 0 - INFO - Log initialized.
Input String  CHECKCARD 0126 ORIGINAL GRAVITY PUBLIC SAN JOSE CA 24690293027080080270199
2014-01-22 17:21:49,808 - thread 0 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,816 - thread 1 - INFO - Log initialized.
2014-01-22 17:21:49,816 - thread 1 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
Input String  ARCO PAYPOINT 01/22 #000490226 PURCHASE 470 RALSTON AVE BELMONT CA
2014-01-22 17:21:49,817 - thread 2 - INFO - Log initialized.
2014-01-22 17:21:49,818 - thread 2 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,818 - thread 3 - INFO - Log initialized.
2014-01-22 17:21:49,819 - thread 3 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,824 - thread 4 - INFO - Log initialized.
2014-01-22 17:21:49,824 - thread 4 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,825 - thread 5 - INFO - Log initialized.
2014-01-22 17:21:49,825 - thread 5 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,825 - thread 6 - INFO - Log initialized.
2014-01-22 17:21:49,825 - thread 6 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,826 - thread 7 - INFO - Log initialized.
2014-01-22 17:21:49,826 - thread 7 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,826 - thread 8 - INFO - Log initialized.
2014-01-22 17:21:49,827 - thread 8 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,827 - thread 9 - INFO - Log initialized.
2014-01-22 17:21:49,827 - thread 9 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:52,967 - thread 1 - INFO - TOKENS ARE: ['ARCO', 'PAYPOINT', '0122', '#0004', '9', '0226', 'PURCHASE', '470', 'RALSTON', 'AVE', 'BELMONT', 'CA']
2014-01-22 17:21:52,973 - thread 1 - INFO - Unigrams are:
    ['ARCO', 'PAYPOINT', '0122', '#0004', '9', '0226', 'PURCHASE', '470', 'RALSTON', 'AVE', 'BELMONT', 'CA']
2014-01-22 17:21:52,974 - thread 1 - INFO - Unigrams matched to ElasticSearch:
    ['PURCHASE', 'RALSTON', 'BELMONT', '#0004', '0226', '0122', 'ARCO', 'AVE', '470', 'CA']
2014-01-22 17:21:52,974 - thread 1 - INFO - Of these:
2014-01-22 17:21:52,974 - thread 1 - INFO -     2 stop words:      ['PAYPOINT', 'PURCHASE']
2014-01-22 17:21:52,980 - thread 1 - INFO -     0 phone_numbers:   []
2014-01-22 17:21:52,985 - thread 1 - INFO -     5 numeric words:   ['0122', '#0004', '9', '0226', '470']
2014-01-22 17:21:52,991 - thread 1 - INFO -     5 unigrams: ['ARCO', 'RALSTON', 'AVE', 'BELMONT', 'CA']
2014-01-22 17:21:53,062 - thread 1 - INFO -     1 addresses: ['470 RALSTON AVE BELMONT CA']
2014-01-22 17:21:53,063 - thread 1 - INFO - Search components are:
2014-01-22 17:21:53,068 - thread 1 - INFO -     Unigrams: 'ARCO RALSTON AVE BELMONT CA'
2014-01-22 17:21:53,069 - thread 1 - INFO -     Matching 'Address': '470 RALSTON AVE BELMONT CA'
2014-01-22 17:21:53,069 - thread 1 - INFO - {"size": 10, "fields": ["BUSINESSSTANDARDNAME", "HOUSE", "PREDIR", "STREET", "STRTYPE", "CITYNAME", "STATE", "ZIP", "pin.location"], "from": 0, "query": {"bool": {"minimum_number_should_match": 1, "should": [{"query_string": {"fields": ["_all^1", "BUSINESSSTANDARDNAME^2"], "query": "ARCO RALSTON AVE BELMONT CA", "boost": 1}}, {"match": {"composite.address^3": {"query": "470 RALSTON AVE BELMONT CA", "boost": 10, "type": "phrase"}}}]}}}
2014-01-22 17:21:53,208 - thread 1 - INFO - This system required 33 individual searches.
2014-01-22 17:21:53,215 - thread 1 - INFO - Z-Score delta: [0.731]
2014-01-22 17:21:53,215 - thread 1 - INFO - Top Score Quality: Low-grade
[0.113] Ralston Florist 932 Ralston Ave Belmont CA 94002 {'lat': '37.519724', 'lon': '-122.276999'}
[0.1] Ralston Florist 936 Ralston Ave Belmont CA 94002 {'lat': '37.519706', 'lon': '-122.27702'}
[0.082] Ralston Village Cleaners 980 Ralston Ave Belmont CA 94002 {'lat': '37.519505', 'lon': '-122.277254'}
[0.081] Ralston Elementary School 2675 Ralston Ave Belmont CA 94002 {'lat': '37.511269', 'lon': '-122.31189'}
[0.074] Belmont Gardens 1100 Ralston Ave Belmont CA 94002 {'lat': '37.517536', 'lon': '-122.278458'}
[0.07] Belmont Optique 877 Ralston Ave Belmont CA 94002 {'lat': '37.51979', 'lon': '-122.276451'}
[0.062] Whipple Arco 504 Whipple Ave Redwood City CA 94063 {'lat': '37.494598', 'lon': '-122.235054'}
[0.061] Universe of Colors of Belmont LLC 887 Ralston Ave Belmont CA 94002 {'lat': '37.519745', 'lon': '-122.276505'}
[0.058] Telegraph Arco 6407 Telegraph Ave Oakland CA 94609 {'lat': '37.850433', 'lon': '-122.260834'}
[0.058] Johns Arco 286 S Livermore Ave Livermore CA 94550 {'lat': '37.681221', 'lon': '-121.766518'}
2014-01-22 17:22:06,945 - thread 0 - INFO - TOKENS ARE: ['CHECKCARD', '0126', 'ORIGINAL', 'GRAVITY', 'PUBLIC', 'SAN', 'JOSE', 'CA', '24', '6902', '930', '27080', '0802', '7019', '9']
2014-01-22 17:22:06,945 - thread 0 - INFO - Unigrams are:
    ['CHECKCARD', '0126', 'ORIGINAL', 'GRAVITY', 'PUBLIC', 'SAN', 'JOSE', 'CA', '24', '6902', '930', '27080', '0802', '7019', '9']
2014-01-22 17:22:06,945 - thread 0 - INFO - Unigrams matched to ElasticSearch:
    ['ORIGINAL', 'GRAVITY', 'PUBLIC', '27080', '7019', '6902', 'JOSE', '0126', '0802', 'SAN', '930', 'CA', '24']
2014-01-22 17:22:06,945 - thread 0 - INFO - Of these:
2014-01-22 17:22:06,945 - thread 0 - INFO -     1 stop words:      ['CHECKCARD']
2014-01-22 17:22:06,945 - thread 0 - INFO -     0 phone_numbers:   []
2014-01-22 17:22:06,945 - thread 0 - INFO -     8 numeric words:   ['0126', '24', '6902', '930', '27080', '0802', '7019', '9']
2014-01-22 17:22:06,945 - thread 0 - INFO -     6 unigrams: ['ORIGINAL', 'GRAVITY', 'PUBLIC', 'SAN', 'JOSE', 'CA']
2014-01-22 17:22:07,294 - thread 0 - INFO -     0 addresses: []
2014-01-22 17:22:07,295 - thread 0 - INFO - Search components are:
2014-01-22 17:22:07,295 - thread 0 - INFO -     Unigrams: 'ORIGINAL GRAVITY PUBLIC SAN JOSE CA'
2014-01-22 17:22:07,295 - thread 0 - INFO - {"size": 10, "fields": ["BUSINESSSTANDARDNAME", "HOUSE", "PREDIR", "STREET", "STRTYPE", "CITYNAME", "STATE", "ZIP", "pin.location"], "from": 0, "query": {"bool": {"minimum_number_should_match": 1, "should": [{"query_string": {"fields": ["_all^1", "BUSINESSSTANDARDNAME^2"], "query": "ORIGINAL GRAVITY PUBLIC SAN JOSE CA", "boost": 1}}]}}}
2014-01-22 17:22:07,402 - thread 0 - INFO - This system required 238 individual searches.
2014-01-22 17:22:07,403 - thread 0 - INFO - Z-Score delta: [3.182]
2014-01-22 17:22:07,403 - thread 0 - INFO - Top Score Quality: High-grade
[6.698] Original Gravity Public House 66 S 1st St San Jose CA 95113 {'lat': '37.335018', 'lon': '-121.889503'}
[1.895] Gravity Mobile 466 8th St San Francisco CA 94103 {'lat': '37.772671', 'lon': '-122.407845'}
[1.816] Leadership Public Schools-San Jose 1881 Cunningham Ave San Jose CA 95122 {'lat': '37.330371', 'lon': '-121.828946'}
[1.789] Gravity Media 2030 Union St San Francisco CA 94123 {'lat': '37.79757', 'lon': '-122.432796'}
[1.788] Gravity People 147 Natoma St San Francisco CA 94105 {'lat': '37.786098', 'lon': '-122.399918'}
[1.779] City of San Jose Public Works 801 N 1st St San Jose CA 95110 {'lat': '37.350967', 'lon': '-121.903526'}
[1.657] Original Buddhism Society In America 1879 Lundy Ave San Jose CA 95131 {'lat': '37.392794', 'lon': '-121.890645'}
[1.527] Original Joes 1704 Union St San Francisco CA 94123 {'lat': '37.798243', 'lon': '-122.427518'}
[1.482] The Original Pancake House 1366 S De Anza Blvd San Jose CA 95129 {'lat': '37.298752', 'lon': '-122.031654'}
[1.475] The Original Pancake House 2306 Almaden Rd San Jose CA 95125 {'lat': '37.292016', 'lon': '-121.880092'}
Traceback (most recent call last):
  File "/Users/jkey/git/longtail/src/__init__.py", line 9, in <module>
    tokenize_descriptions.start()
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 102, in start
    tokenize(params, desc_queue)
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 58, in tokenize
    write_output(params, result_queue)
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 66, in write_output
    file_name = params["output"]["file"]["path"] or '../data/longtailLabeled.csv'
KeyError: ‘file'

Integrate Binary Classifier into project

Binary classifier should be integrated such that it removes non physical transactions before forwarding them through the merchant labeler.

Should be configurable from config, such that non physical transactions can be outputted for processing by another system.

Should probably be a sub package such that it's parts and pieces are isolated from the rest of the project.

Improved Accuracy Output

Taken from issue #53
Improved Accuracy Output (Matt)

Not sure what "improved" means. Add something to this issue to ensure that we know what needs to be done and why.

Move test_accuracy.py into various_tools.py

Merge the functionality into various_tools.py

Matt Sevrens, estimates for stories for week starting 2014-01-27

"Feature discovery" requires some thought.

Python3 research - How exactly do imports among modules work?
35 sp
Python3 research - How exactly should we structure unit tests?
5 sp
Binary Classifier -
21 sp
Decision boundary for recall -
35 sp
a. Research zscores (wikipedia, scipy and numpy docs) 21 sp
b. Writing source code 13
Parameter Vector
21 sp
a. Research numpy, especially arrays 13
b. Writing source code 8
Location estimator, depends on decision boundary
90 sp
ElasticSearch research - How to use the Geo API? 13 sp
Writing source code 50 sp

Function to get requested fieldnames by PERSISTENTRECORDID

We should be easily able to provide a list of PERSISTENTRECORDIDS and a list of fieldnames and be able to get back all of the requested fields. This should most likely go into various tools as it seems like it would be a generally reusable and useful function.

"write_output_to_file" function fails when "csv" is provided as the format

Please validate the JSON config file before processing begins.

Here are logs demonstrating how complete processing takes place before throwing an Exception because the "output.file" key was not found in the JSON config file.

STEP TO REPRODUCE:

Run __init__.py using config/single_output_csv.json

EXPECTED RESULT:

An output file containing a comma separated values.
No exceptions

ACTUAL RESULT:

Input String  "CHECKCARD 0120 24HOUR FITNESS USA,INC 800-432-6348 CA..."
Input String  "CHECKCARD 0221 24HOUR FITNESS USA,INC 800-432-6348 CA..."
Input String  "CHECKCARD 0321 24HOUR FITNESS USA,INC 800-432-6348 CA..."
Input String  "CHECKCARD 0420 24HOUR FITNESS USA,INC 800-432-6348 CA..."
...
[1.375] Atria Sunnyvale 175 E 41683957
[0.893] Tlr 25 41524214
[1.375] Atria Sunnyvale 175 E 41683957
[1.375] Atria Sunnyvale 175 E 41683957
[0.846] Monroe The 1870 9165730
Traceback (most recent call last):
  File "/Users/jkey/git/longtail/src/__init__.py", line 9, in <module>
    tokenize_descriptions.start()
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 101, in start
    tokenize(params, desc_queue)
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 60, in tokenize
    write_output_to_file(params, result_queue)
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 84, in write_output_to_file
    dict_w.writerows(output_list)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/csv.py", line 158, in writerows
    rows.append(self._dict_to_list(rowdict))
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/csv.py", line 149, in _dict_to_list
    + ", ".join(wrong_fields))
ValueError: dict contains fields not in fieldnames: PREDIR

Set up a JSON-formatted configuration file as the only parameter

To handle the large variety of ways we might use our "tokenize_description" library, we should abandon the idea of supplying a variety of command-line arguments in favor of referring to a single configuration file of parameters.

For instance:

{
"input" : {
"type" : "local",
"path" : "../short_bank_transaction_descriptions.csv"
},
"logging" : {
"level" : "INFO",
"path" : "../logs.log"
"format" : "%(levelname)s:%(message)s"
}
"output" : {
"type" : "local",
"path" : "../short_bank_transaction_descriptions_results.csv"}
"output" : "name_of_python_function_to_format_output"
}

Output Non-Physical transactions to file or queue for further processing

Fill the production cluster with 20K unlabeled transactions (simple random sampling)

Needed in order to build our classifier
- Since this is a Simple Random Sample, we can use the standard error function to determine our sample size
- http://en.wikipedia.org/wiki/Margin_of_error
- http://en.wikipedia.org/wiki/Standard_error_%28statistics%29

Create Feature/Parameter Matrix for later model to be built on

Should include at least:
Confidence Threshold
Z_Score
Number of results to run Z_score on
Distance for location estimator
Boost for location estimator
Boosts for each relevant field:

Others?

ElasticSearch 404 issue on the following description strings.

This can be cause by multiple issues such as a bad character, or an empty query string.

Two current transactions causing this:
S AND S TIRE A 02/21 #000484150 PURCHASE 597 S MURPHY AVE SUNNYVALE CA
S AND S TIRE A 12/23 #000607706 PURCHASE S AND S TIRE AN SUNNYVALE CA

In these particular cases it's due to an empty query string. Will have to debug why that occurs as well.

description_consumer.py Unit Tests

jkey - week of 2014-01-27

Forced Rank for new features:
#15 Node discovery based upon a single node from within a configuration file 6b788dd
#36 Add n-grams (n >=2) back to the final bool query 2a8a4e3
#37 description_consumer should take advantage of a shared dict of results 25166a2
#35 Reserved words "AND" and "OR" fail when submitted by themselves to __search_index d802ce8

First Panel Milestone

NOT NECESSARY

Location boost #9
Maximum processing throughput

Support specialized output formats.

What formats will we initially support?

Delimited text (1st) #19
JSON

Where will we send the output?

File? (1st) #20
ElasticSearch?
Console?
Configurable?

Location boost for ambiguous transactions

Ambiguous transactions should be boosted by unambiguous surrounding transactions. We could do this by:

Having an input file that contains: USER_ID, DESCRIPTION, DATE

The file (sorted by date) would be run against our tokenize_descriptions file, and only results with a score over threshold_score would be labeled. The file would be then run again, but with a distance parameter boost relative to the transactions with the highest score. This could be repeated if it still provides sufficient advantages on each successive run.

Very important to think about in regards to parallelization as transactions don't occur in isolation. This would also reduce the number of searches necessary to get the correct merchant.

POI in Scikit-Learn

http://scikit-learn.org/stable/tutorial/basic/tutorial.html
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html#example-document-classification-20newsgroups-py

Write a single unit test

Anywhere in the project.

Move composite feature definitions from query_templates.py to config_file

Refer to query_templates:get_composites

Process each description string in its own thread within "tokenize_description.py"

There are no dependencies among description strings in our file, so there is no reason to process each string sequentially. Let's add support for:

creating a synchronized queue of description strings from the input file
Spawning a configurable number of consumer threads
Ensuring consumer threads will tokenize the string, logging data to the same file
Each thread will write its output to another synchronized queue.
Results are piped to the output file, in a non-interleaved way.

ElasticSearch 404 Errors halt execution

When elasticsearch returns with a 404, it breaks execution of tokenize_descriptions.

It should instead log the error, perhaps into a special file and continue to the next description in the queue.

Elasticsearch 404 halts execution of tokenize_descriptions.py

When elasticsearch returns with a 404, it breaks execution of tokenize_descriptions.

This can be cause by multiple issues such as a bad character, or an empty query string.

Two current transactions causing this:
S AND S TIRE A 02/21 #000484150 PURCHASE 597 S MURPHY AVE SUNNYVALE CA
S AND S TIRE A 12/23 #000607706 PURCHASE S AND S TIRE AN SUNNYVALE CA

In these particular cases it's due to an empty query string. Will have to debug why that occurs as well.

description_consumer should take advantage of a shared dict of results, to prevent re-issue of the same query

Several description_consumer threads will issue the exact same ElasticSearch queries. I assert that if we maintain a cache of results, we could vastly reduce the number of ES queries needed to evaluate a set of transaction descriptions.

Support output as delimited text

Configure the delimiter from within the config_file.

redpanda-ai / meerkat Goto Github PK

meerkat's People

Contributors

Stargazers

Watchers

Forkers

meerkat's Issues

STEP TO REPRODUCE:

EXPECTED RESULT:

ACTUAL RESULT:

STEP TO REPRODUCE:

EXPECTED RESULT:

ACTUAL RESULT:

NOT NECESSARY

Recommend Projects

Recommend Topics

Recommend Org