Code Monkey home page Code Monkey logo

meerkat's People

Contributors

dbiswas1 avatar diwu001 avatar ffeizhu avatar hvudumala avatar jiegzhan avatar josephaltmaier avatar nsivasu avatar oscardpan avatar rayaankhatau avatar redpanda-ai avatar sivanmehta avatar speakerjohnash avatar vnagarajy avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

trellixvulnteam

meerkat's Issues

testAccuracy.py produces 11 PyLint errors

You should be able to see these in the "Problems" tab and as yellow yield signs on your editor.

Additionally, the names testAccuracy.py and mergeResults.py should be renamed to test_accuracy.py and merge_results.py.

Re: tokenize_descriptions output

We need to figure out how to output our results. Some possibilities:

  1. Human readable for demoing to others (how it is now)
  2. A list of extracted features
  3. The top result with only the top fields listed
  4. Some n number of top results.
  5. Just the persistentrecordid
  6. Leave human output, but allow to pass in a file to output any of the top or other formats

New Feature ideas (please decompose into specific issues).

My thought was to leave the current printing scheme and instead include a more elaborate available set of arguments to pass in. Perhaps with one option being to export the bare results to a specific file. Or, we could have some function like conditional print, where if you pass an argument it will only print the top result.

But realistically wwapping out the prints for logs the first time didn't really work that well and I don't want to do it again. I think it would be a better idea to have a more robust set of arguments to pass with the string, one which would allow you to pass a filename which it would append the top result to.

2014-01-03 Brainstorming

  1. Should we consider "feature engineering" or "feature learning" to be the correct path? UNKNOWN
  2. What algorithms are available for "feature learning"? https://en.wikipedia.org/wiki/Feature_learning
  3. What algorithms are available for "feature engineering"? UNKNOWN
  4. What is our timeline for finishing our basic toolset and moving on to:
    a. building a machine learning algorithm (OPEN)
  5. How much time would be spent on research? (ESTIMATED)
  6. How much time would be spent on development? (ESTIMATED)
  7. Are we hoping to build up the system incrementally? YES
  8. How can text be encoded into numerical values? (OPEN)
  9. Can we conduct a git exercise to allow independent feature development that will easily merge? What are the steps involved? More importantly, when are we going to do them? (COMPLETED)

new_bulk_loader branch

Will allow us to efficiently index 25 million records into ElaticSearch.
There will be special consideration for geo coordinate mappings to ensure that we can support search regions bounded by polygons.

Label transaction data

Prerequisite #60

  • Label transaction data (Matt, Andy)
    • At least 10,000 descriptions, to satisfy 95% confidence interval with 1% margin of error
    • Labelled as physical or not-physical
    • Labelled as unique id for factual merchant

new factual.com description_consumer and description_producer

  • Requirements
    • Needs to deal with factual.com as it was loaded by the new_bulk_loader module.
    • Will ultimately replace our existing description_consumer and description_producer modules.
    • Will deal with composite features in a generic way involving loading composite definitions from a configuration file.

We need to recruit multiple ES nodes for our tokenizer

  1. Each DescriptionConsumer thread should be assigned its own ES node for all queries.
  2. A list containing at least one ES node must be provided within a configuration file.
  3. If an ES node is not provided, throw an error and abort.

Need more verbose error messages on Params load.

I don't know if verbose is the right word. Whatever.

Point is, if loading anything in the config file fails, it just throws a "can't find file" error. This lead to some confusion and should be fixed.

"write_output" function fails on bad configuration file.

Please validate the JSON config file before processing begins.

Here are logs demonstrating how complete processing takes place before throwing an Exception because the "output.file" key was not found in the JSON config file.

STEP TO REPRODUCE:

  1. Run __init__.py using config/default.json

EXPECTED RESULT:

  1. An informative error message stating that the configuration file must have an "output.file" key and value; the program should then immediately halt without doing any actual work.

ACTUAL RESULT:

2014-01-22 17:21:49,808 - thread 0 - INFO - Log initialized.
Input String  CHECKCARD 0126 ORIGINAL GRAVITY PUBLIC SAN JOSE CA 24690293027080080270199
2014-01-22 17:21:49,808 - thread 0 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,816 - thread 1 - INFO - Log initialized.
2014-01-22 17:21:49,816 - thread 1 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
Input String  ARCO PAYPOINT 01/22 #000490226 PURCHASE 470 RALSTON AVE BELMONT CA
2014-01-22 17:21:49,817 - thread 2 - INFO - Log initialized.
2014-01-22 17:21:49,818 - thread 2 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,818 - thread 3 - INFO - Log initialized.
2014-01-22 17:21:49,819 - thread 3 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,824 - thread 4 - INFO - Log initialized.
2014-01-22 17:21:49,824 - thread 4 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,825 - thread 5 - INFO - Log initialized.
2014-01-22 17:21:49,825 - thread 5 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,825 - thread 6 - INFO - Log initialized.
2014-01-22 17:21:49,825 - thread 6 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,826 - thread 7 - INFO - Log initialized.
2014-01-22 17:21:49,826 - thread 7 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,826 - thread 8 - INFO - Log initialized.
2014-01-22 17:21:49,827 - thread 8 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:49,827 - thread 9 - INFO - Log initialized.
2014-01-22 17:21:49,827 - thread 9 - INFO - {
    "concurrency": 10,
    "input": {
        "encoding": "utf-8",
        "filename": "../data/short_bank_transaction_descriptions.csv"
    },
    "logging": {
        "console": true,
        "formatter": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        "level": "info",
        "path": "../logs/logs.log"
    },
    "output": {
        "filepath": "../data/longTailLabeled.csv",
        "results": {
            "fields": [
                "BUSINESSSTANDARDNAME",
                "HOUSE",
                "PREDIR",
                "STREET",
                "STRTYPE",
                "CITYNAME",
                "STATE",
                "ZIP",
                "pin.location"
            ],
            "size": 10
        }
    }
}
2014-01-22 17:21:52,967 - thread 1 - INFO - TOKENS ARE: ['ARCO', 'PAYPOINT', '0122', '#0004', '9', '0226', 'PURCHASE', '470', 'RALSTON', 'AVE', 'BELMONT', 'CA']
2014-01-22 17:21:52,973 - thread 1 - INFO - Unigrams are:
    ['ARCO', 'PAYPOINT', '0122', '#0004', '9', '0226', 'PURCHASE', '470', 'RALSTON', 'AVE', 'BELMONT', 'CA']
2014-01-22 17:21:52,974 - thread 1 - INFO - Unigrams matched to ElasticSearch:
    ['PURCHASE', 'RALSTON', 'BELMONT', '#0004', '0226', '0122', 'ARCO', 'AVE', '470', 'CA']
2014-01-22 17:21:52,974 - thread 1 - INFO - Of these:
2014-01-22 17:21:52,974 - thread 1 - INFO -     2 stop words:      ['PAYPOINT', 'PURCHASE']
2014-01-22 17:21:52,980 - thread 1 - INFO -     0 phone_numbers:   []
2014-01-22 17:21:52,985 - thread 1 - INFO -     5 numeric words:   ['0122', '#0004', '9', '0226', '470']
2014-01-22 17:21:52,991 - thread 1 - INFO -     5 unigrams: ['ARCO', 'RALSTON', 'AVE', 'BELMONT', 'CA']
2014-01-22 17:21:53,062 - thread 1 - INFO -     1 addresses: ['470 RALSTON AVE BELMONT CA']
2014-01-22 17:21:53,063 - thread 1 - INFO - Search components are:
2014-01-22 17:21:53,068 - thread 1 - INFO -     Unigrams: 'ARCO RALSTON AVE BELMONT CA'
2014-01-22 17:21:53,069 - thread 1 - INFO -     Matching 'Address': '470 RALSTON AVE BELMONT CA'
2014-01-22 17:21:53,069 - thread 1 - INFO - {"size": 10, "fields": ["BUSINESSSTANDARDNAME", "HOUSE", "PREDIR", "STREET", "STRTYPE", "CITYNAME", "STATE", "ZIP", "pin.location"], "from": 0, "query": {"bool": {"minimum_number_should_match": 1, "should": [{"query_string": {"fields": ["_all^1", "BUSINESSSTANDARDNAME^2"], "query": "ARCO RALSTON AVE BELMONT CA", "boost": 1}}, {"match": {"composite.address^3": {"query": "470 RALSTON AVE BELMONT CA", "boost": 10, "type": "phrase"}}}]}}}
2014-01-22 17:21:53,208 - thread 1 - INFO - This system required 33 individual searches.
2014-01-22 17:21:53,215 - thread 1 - INFO - Z-Score delta: [0.731]
2014-01-22 17:21:53,215 - thread 1 - INFO - Top Score Quality: Low-grade
[0.113] Ralston Florist 932 Ralston Ave Belmont CA 94002 {'lat': '37.519724', 'lon': '-122.276999'}
[0.1] Ralston Florist 936 Ralston Ave Belmont CA 94002 {'lat': '37.519706', 'lon': '-122.27702'}
[0.082] Ralston Village Cleaners 980 Ralston Ave Belmont CA 94002 {'lat': '37.519505', 'lon': '-122.277254'}
[0.081] Ralston Elementary School 2675 Ralston Ave Belmont CA 94002 {'lat': '37.511269', 'lon': '-122.31189'}
[0.074] Belmont Gardens 1100 Ralston Ave Belmont CA 94002 {'lat': '37.517536', 'lon': '-122.278458'}
[0.07] Belmont Optique 877 Ralston Ave Belmont CA 94002 {'lat': '37.51979', 'lon': '-122.276451'}
[0.062] Whipple Arco 504 Whipple Ave Redwood City CA 94063 {'lat': '37.494598', 'lon': '-122.235054'}
[0.061] Universe of Colors of Belmont LLC 887 Ralston Ave Belmont CA 94002 {'lat': '37.519745', 'lon': '-122.276505'}
[0.058] Telegraph Arco 6407 Telegraph Ave Oakland CA 94609 {'lat': '37.850433', 'lon': '-122.260834'}
[0.058] Johns Arco 286 S Livermore Ave Livermore CA 94550 {'lat': '37.681221', 'lon': '-121.766518'}
2014-01-22 17:22:06,945 - thread 0 - INFO - TOKENS ARE: ['CHECKCARD', '0126', 'ORIGINAL', 'GRAVITY', 'PUBLIC', 'SAN', 'JOSE', 'CA', '24', '6902', '930', '27080', '0802', '7019', '9']
2014-01-22 17:22:06,945 - thread 0 - INFO - Unigrams are:
    ['CHECKCARD', '0126', 'ORIGINAL', 'GRAVITY', 'PUBLIC', 'SAN', 'JOSE', 'CA', '24', '6902', '930', '27080', '0802', '7019', '9']
2014-01-22 17:22:06,945 - thread 0 - INFO - Unigrams matched to ElasticSearch:
    ['ORIGINAL', 'GRAVITY', 'PUBLIC', '27080', '7019', '6902', 'JOSE', '0126', '0802', 'SAN', '930', 'CA', '24']
2014-01-22 17:22:06,945 - thread 0 - INFO - Of these:
2014-01-22 17:22:06,945 - thread 0 - INFO -     1 stop words:      ['CHECKCARD']
2014-01-22 17:22:06,945 - thread 0 - INFO -     0 phone_numbers:   []
2014-01-22 17:22:06,945 - thread 0 - INFO -     8 numeric words:   ['0126', '24', '6902', '930', '27080', '0802', '7019', '9']
2014-01-22 17:22:06,945 - thread 0 - INFO -     6 unigrams: ['ORIGINAL', 'GRAVITY', 'PUBLIC', 'SAN', 'JOSE', 'CA']
2014-01-22 17:22:07,294 - thread 0 - INFO -     0 addresses: []
2014-01-22 17:22:07,295 - thread 0 - INFO - Search components are:
2014-01-22 17:22:07,295 - thread 0 - INFO -     Unigrams: 'ORIGINAL GRAVITY PUBLIC SAN JOSE CA'
2014-01-22 17:22:07,295 - thread 0 - INFO - {"size": 10, "fields": ["BUSINESSSTANDARDNAME", "HOUSE", "PREDIR", "STREET", "STRTYPE", "CITYNAME", "STATE", "ZIP", "pin.location"], "from": 0, "query": {"bool": {"minimum_number_should_match": 1, "should": [{"query_string": {"fields": ["_all^1", "BUSINESSSTANDARDNAME^2"], "query": "ORIGINAL GRAVITY PUBLIC SAN JOSE CA", "boost": 1}}]}}}
2014-01-22 17:22:07,402 - thread 0 - INFO - This system required 238 individual searches.
2014-01-22 17:22:07,403 - thread 0 - INFO - Z-Score delta: [3.182]
2014-01-22 17:22:07,403 - thread 0 - INFO - Top Score Quality: High-grade
[6.698] Original Gravity Public House 66 S 1st St San Jose CA 95113 {'lat': '37.335018', 'lon': '-121.889503'}
[1.895] Gravity Mobile 466 8th St San Francisco CA 94103 {'lat': '37.772671', 'lon': '-122.407845'}
[1.816] Leadership Public Schools-San Jose 1881 Cunningham Ave San Jose CA 95122 {'lat': '37.330371', 'lon': '-121.828946'}
[1.789] Gravity Media 2030 Union St San Francisco CA 94123 {'lat': '37.79757', 'lon': '-122.432796'}
[1.788] Gravity People 147 Natoma St San Francisco CA 94105 {'lat': '37.786098', 'lon': '-122.399918'}
[1.779] City of San Jose Public Works 801 N 1st St San Jose CA 95110 {'lat': '37.350967', 'lon': '-121.903526'}
[1.657] Original Buddhism Society In America 1879 Lundy Ave San Jose CA 95131 {'lat': '37.392794', 'lon': '-121.890645'}
[1.527] Original Joes 1704 Union St San Francisco CA 94123 {'lat': '37.798243', 'lon': '-122.427518'}
[1.482] The Original Pancake House 1366 S De Anza Blvd San Jose CA 95129 {'lat': '37.298752', 'lon': '-122.031654'}
[1.475] The Original Pancake House 2306 Almaden Rd San Jose CA 95125 {'lat': '37.292016', 'lon': '-121.880092'}
Traceback (most recent call last):
  File "/Users/jkey/git/longtail/src/__init__.py", line 9, in <module>
    tokenize_descriptions.start()
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 102, in start
    tokenize(params, desc_queue)
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 58, in tokenize
    write_output(params, result_queue)
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 66, in write_output
    file_name = params["output"]["file"]["path"] or '../data/longtailLabeled.csv'
KeyError: ‘file'

Integrate Binary Classifier into project

Binary classifier should be integrated such that it removes non physical transactions before forwarding them through the merchant labeler.

Should be configurable from config, such that non physical transactions can be outputted for processing by another system.

Should probably be a sub package such that it's parts and pieces are isolated from the rest of the project.

Improved Accuracy Output

Taken from issue #53
Improved Accuracy Output (Matt)

Not sure what "improved" means. Add something to this issue to ensure that we know what needs to be done and why.

Matt Sevrens, estimates for stories for week starting 2014-01-27

"Feature discovery" requires some thought.

  1. Python3 research - How exactly do imports among modules work?
    35 sp
  2. Python3 research - How exactly should we structure unit tests?
    5 sp
  3. Binary Classifier -
    21 sp
  4. Decision boundary for recall -
    35 sp
    a. Research zscores (wikipedia, scipy and numpy docs) 21 sp
    b. Writing source code 13
  5. Parameter Vector
    21 sp
    a. Research numpy, especially arrays 13
    b. Writing source code 8
  6. Location estimator, depends on decision boundary
    90 sp
    ElasticSearch research - How to use the Geo API? 13 sp
    Writing source code 50 sp

Function to get requested fieldnames by PERSISTENTRECORDID

We should be easily able to provide a list of PERSISTENTRECORDIDS and a list of fieldnames and be able to get back all of the requested fields. This should most likely go into various tools as it seems like it would be a generally reusable and useful function.

"write_output_to_file" function fails when "csv" is provided as the format

Please validate the JSON config file before processing begins.

Here are logs demonstrating how complete processing takes place before throwing an Exception because the "output.file" key was not found in the JSON config file.

STEP TO REPRODUCE:

  1. Run __init__.py using config/single_output_csv.json

EXPECTED RESULT:

  1. An output file containing a comma separated values.
  2. No exceptions

ACTUAL RESULT:

Input String  "CHECKCARD 0120 24HOUR FITNESS USA,INC 800-432-6348 CA..."
Input String  "CHECKCARD 0221 24HOUR FITNESS USA,INC 800-432-6348 CA..."
Input String  "CHECKCARD 0321 24HOUR FITNESS USA,INC 800-432-6348 CA..."
Input String  "CHECKCARD 0420 24HOUR FITNESS USA,INC 800-432-6348 CA..."
...
[1.375] Atria Sunnyvale 175 E 41683957
[0.893] Tlr 25 41524214
[1.375] Atria Sunnyvale 175 E 41683957
[1.375] Atria Sunnyvale 175 E 41683957
[0.846] Monroe The 1870 9165730
Traceback (most recent call last):
  File "/Users/jkey/git/longtail/src/__init__.py", line 9, in <module>
    tokenize_descriptions.start()
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 101, in start
    tokenize(params, desc_queue)
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 60, in tokenize
    write_output_to_file(params, result_queue)
  File "/Users/jkey/git/longtail/src/tokenize_descriptions.py", line 84, in write_output_to_file
    dict_w.writerows(output_list)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/csv.py", line 158, in writerows
    rows.append(self._dict_to_list(rowdict))
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/csv.py", line 149, in _dict_to_list
    + ", ".join(wrong_fields))
ValueError: dict contains fields not in fieldnames: PREDIR

Set up a JSON-formatted configuration file as the only parameter

To handle the large variety of ways we might use our "tokenize_description" library, we should abandon the idea of supplying a variety of command-line arguments in favor of referring to a single configuration file of parameters.

For instance:

{
"input" : {
"type" : "local",
"path" : "../short_bank_transaction_descriptions.csv"
},
"logging" : {
"level" : "INFO",
"path" : "../logs.log"
"format" : "%(levelname)s:%(message)s"
}
"output" : {
"type" : "local",
"path" : "../short_bank_transaction_descriptions_results.csv"}
"output" : "name_of_python_function_to_format_output"
}

ElasticSearch 404 issue on the following description strings.

This can be cause by multiple issues such as a bad character, or an empty query string.

Two current transactions causing this:
S AND S TIRE A 02/21 #000484150 PURCHASE 597 S MURPHY AVE SUNNYVALE CA
S AND S TIRE A 12/23 #000607706 PURCHASE S AND S TIRE AN SUNNYVALE CA

In these particular cases it's due to an empty query string. Will have to debug why that occurs as well.

jkey - week of 2014-01-27

Forced Rank for new features:
#15 Node discovery based upon a single node from within a configuration file 6b788dd
#36 Add n-grams (n >=2) back to the final bool query 2a8a4e3
#37 description_consumer should take advantage of a shared dict of results 25166a2
#35 Reserved words "AND" and "OR" fail when submitted by themselves to __search_index d802ce8

Location boost for ambiguous transactions

Ambiguous transactions should be boosted by unambiguous surrounding transactions. We could do this by:

Having an input file that contains: USER_ID, DESCRIPTION, DATE

The file (sorted by date) would be run against our tokenize_descriptions file, and only results with a score over threshold_score would be labeled. The file would be then run again, but with a distance parameter boost relative to the transactions with the highest score. This could be repeated if it still provides sufficient advantages on each successive run.

Very important to think about in regards to parallelization as transactions don't occur in isolation. This would also reduce the number of searches necessary to get the correct merchant.

Process each description string in its own thread within "tokenize_description.py"

There are no dependencies among description strings in our file, so there is no reason to process each string sequentially. Let's add support for:

  1. creating a synchronized queue of description strings from the input file
  2. Spawning a configurable number of consumer threads
  3. Ensuring consumer threads will tokenize the string, logging data to the same file
  4. Each thread will write its output to another synchronized queue.
  5. Results are piped to the output file, in a non-interleaved way.

ElasticSearch 404 Errors halt execution

When elasticsearch returns with a 404, it breaks execution of tokenize_descriptions.

It should instead log the error, perhaps into a special file and continue to the next description in the queue.

Elasticsearch 404 halts execution of tokenize_descriptions.py

When elasticsearch returns with a 404, it breaks execution of tokenize_descriptions.

This can be cause by multiple issues such as a bad character, or an empty query string.

Two current transactions causing this:
S AND S TIRE A 02/21 #000484150 PURCHASE 597 S MURPHY AVE SUNNYVALE CA
S AND S TIRE A 12/23 #000607706 PURCHASE S AND S TIRE AN SUNNYVALE CA

In these particular cases it's due to an empty query string. Will have to debug why that occurs as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.