Code Monkey home page Code Monkey logo

formasaurus's People

Contributors

kmike avatar lopuhin avatar lucywang000 avatar mehaase avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

formasaurus's Issues

Deprecation of sklearn.externals.joblib

$ formasaurus init
/usr/local/lib/python3.8/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)

scikit-learn 0.23 was released three days ago: https://github.com/scikit-learn/scikit-learn/releases
New installations of formasaurus are not possible any more using the default pip installation.

Either, the code is adapted (import directly from joblib), or requirements.txt is updated to:

\# 0.18 is need for GroupKFold, 0.23 sklearn.externals.joblib deprecated
scikit-learn >= 0.18, <=0.22.2.post1

Current workaround is:

pip3 install formasaurus
pip3 install -U scikit-learn==0.22.2.post1

fix evaluation.py

  • it should split data using LabelKFold;
  • it should also check field classifier;
  • classification reports should use cross_val_predict, not just train/test split

Formasaurus init fails with scikit-learn 1.2.0

It seems that the version of scikit-learn v1.2.0 releases in Dec 2022 is breaking the formasaurus init command. See the following output:

Training form type detector on 1423 example(s)...
#9 4.760 Traceback (most recent call last):
#9 4.760   File "/usr/local/bin/formasaurus", line 33, in <module>
#9 4.761     sys.exit(load_entry_point('formasaurus==0.9.0', 'console_scripts', 'formasaurus')())
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/__main__.py", line 72, in main
#9 4.761     formasaurus.FormFieldClassifier.load()
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 101, in load
#9 4.761     ex = cls.trained_on(DEFAULT_DATA_PATH)
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 119, in trained_on
#9 4.761     ex.train(annotations)
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 131, in train
#9 4.761     self.form_classifier.train(annotations)
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 266, in train
#9 4.761     self.model = formtype_model.train(
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/formtype_model.py", line 128, in train
#9 4.762     return model.fit(X, y)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 402, in fit
#9 4.762     Xt = self._fit(X, y, **fit_params_steps)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 360, in _fit
#9 4.762     X, fitted_transformer = fit_transform_one_cached(
#9 4.762   File "/usr/local/lib/python3.9/site-packages/joblib/memory.py", line 349, in __call__
#9 4.762     return self.func(*args, **kwargs)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 894, in _fit_transform_one
#9 4.762     res = transformer.fit_transform(X, y, **fit_params)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
#9 4.763     data_to_wrap = f(self, X, *args, **kwargs)
#9 4.763   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 1193, in fit_transform
#9 4.763     results = self._parallel_func(X, y, fit_params, _fit_transform_one)
#9 4.763   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 1215, in _parallel_func
#9 4.763     return Parallel(n_jobs=self.n_jobs)(
#9 4.763   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 1088, in __call__
#9 4.764     while self.dispatch_one_batch(iterator):
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
#9 4.764     self._dispatch(tasks)
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
#9 4.764     job = self._backend.apply_async(batch, callback=cb)
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
#9 4.764     result = ImmediateResult(func)
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
#9 4.764     self.results = batch()
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
#9 4.765     return [func(*args, **kwargs)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
#9 4.765     return [func(*args, **kwargs)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/sklearn/utils/fixes.py", line 117, in __call__
#9 4.765     return self.function(*args, **kwargs)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 894, in _fit_transform_one
#9 4.765     res = transformer.fit_transform(X, y, **fit_params)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 446, in fit_transform
#9 4.766     return last_step.fit_transform(Xt, y, **fit_params_last_step)
#9 4.766   File "/usr/local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 2121, in fit_transform
#9 4.766     X = super().fit_transform(raw_documents)
#9 4.766   File "/usr/local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1358, in fit_transform
#9 4.768     self._validate_params()
#9 4.768   File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 570, in _validate_params
#9 4.768     validate_parameter_constraints(
#9 4.768   File "/usr/local/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
#9 4.768     raise InvalidParameterError(
#9 4.768 sklearn.utils._param_validation.InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None. Got {'and', 'of', 'or'} instead.

This command works fine with the previous version of scikit-learn v1.1.3

formasaurus check-data fails locally

formasaurus check-data fails on my machine (but not on Travis):

Checking:  16%|####                      | 149/954 [00:00<00:05, 137.76 files/s]
Invalid form count for entry 'html/ddl-warez.in-0.html': expected 0, got 1
Invalid number of form field annotations for entry 'html/ddl-warez.in-0.html'
Checking:  59%|###############4          | 567/954 [00:03<00:01, 222.76 files/s]
Invalid form count for entry 'html/cafephim.vn-1.html': expected 0, got 2
Invalid number of form field annotations for entry 'html/cafephim.vn-1.html'
Checking:  77%|####################      | 736/954 [00:05<00:01, 209.64 files/s]
Invalid form count for entry 'html/postr.hu-2.html': expected 0, got 6
Invalid number of form field annotations for entry 'html/postr.hu-2.html'
Checking:  78%|####################1     | 740/954 [00:05<00:01, 207.79 files/s]
Invalid form count for entry 'html/www.elandroidelibre.com-0.html': expected 0, got 1
Invalid number of form field annotations for entry 'html/www.elandroidelibre.com-0.html'
Checking:  99%|#########################6| 942/954 [00:06<00:00, 252.47 files/s]
Invalid form count for entry 'html/postr.hu-1.html': expected 0, got 6
Invalid number of form field annotations for entry 'html/postr.hu-1.html'
Checking: 100%|##########################| 954/954 [00:06<00:00, 132.87 files/s]
Status: 10 error(s) found

Captcha forms

Right now Formasaurus does not seem to support captcha forms: forms with a single text input and some image with captcha that are designed to block crawlers. It does not have such a form type, and when applied to such forms (I tried only two so far) it does not detect the captcha field correctly.
Are such forms in score of the library? What is the reasonable number of such forms to include into the training dataset?

error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2

"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -Icrfsuite/include/ -Icrfsuite/lib/cqdb/include -Iliblbfgs/include -Ipycrfsuite -IC:\Users\tonia\AppData\Local\Programs\Python\Python310\include -IC:\Users\tonia\AppData\Local\Programs\Python\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tppycrfsuite/_pycrfsuite.cpp /Fobuild\temp.win-amd64-3.10\Release\pycrfsuite/_pycrfsuite.obj
_pycrfsuite.cpp
pycrfsuite/_pycrfsuite.cpp(13546): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
pycrfsuite/_pycrfsuite.cpp(13562): warning C4996: '_PyUnicode_get_wstr_length': deprecated in 3.3
pycrfsuite/_pycrfsuite.cpp(13906): warning C4996: 'PyUnicode_FromUnicode': deprecated in 3.3
pycrfsuite/_pycrfsuite.cpp(17069): error C3861: '_PyGen_Send': identifier not found
pycrfsuite/_pycrfsuite.cpp(17074): error C3861: '_PyGen_Send': identifier not found
pycrfsuite/_pycrfsuite.cpp(17158): error C3861: '_PyGen_Send': identifier not found
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> python-crfsuite

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.
image

pip install Formasaurus[with-deps]==0.7.1 does not install deps

pip install Formasaurus[with-deps]==0.7.1 does not install deps using latest pip and setuptools, on python 2.7 and python 3.4:

$ pip install formasaurus[with-deps]
Collecting formasaurus[with_deps]
/home/ubuntu/autologin/venv/local/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:315: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
  SNIMissingWarning
/home/ubuntu/autologin/venv/local/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:120: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading formasaurus-0.7.1-py2.py3-none-any.whl (13.7MB)
    100% |████████████████████████████████| 13.7MB 96kB/s
Collecting requests (from formasaurus[with_deps])
  Using cached requests-2.9.1-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): six in ./venv/lib/python2.7/site-packages/six-1.10.0-py2.7.egg (from formasaurus[with_deps])
Collecting tldextract (from formasaurus[with_deps])
  Using cached tldextract-1.7.5.tar.gz
Collecting tqdm>=2.0 (from formasaurus[with_deps])
  Using cached tqdm-3.8.0-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): w3lib>=1.13.0 in ./venv/lib/python2.7/site-packages/w3lib-1.13.0-py2.7.egg (from formasaurus[with_deps])
Collecting docopt (from formasaurus[with_deps])
  Using cached docopt-0.6.2.tar.gz
Requirement already satisfied (use --upgrade to upgrade): setuptools in ./venv/lib/python2.7/site-packages (from tldextract->formasaurus[with_deps])
Requirement already satisfied (use --upgrade to upgrade): idna in ./venv/lib/python2.7/site-packages/idna-2.0-py2.7.egg (from tldextract->formasaurus[with_deps])
Installing collected packages: requests, tldextract, tqdm, docopt, formasaurus
  Running setup.py install for tldextract ... done
  Running setup.py install for docopt ... done
Successfully installed docopt-0.6.2 formasaurus-0.7.1 requests-2.9.1 tldextract-1.7.5 tqdm-3.8.0

pip Formasaurus[annotation] works, pip install .[with-deps] also works from source. I also tried with an underscore instead of a dash (with_deps), but this still does not work for me. I could be doing something wrong, or this could be some issue with setuptools? Perhaps renaming this option to with_deps or deps would help.

Deprecation warnings

Some deprecation warnings when using sklearn 0.18.2:

/usr/local/lib/python3.6/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

/usr/local/lib/python3.6/dist-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.

Button elements are ignored

Trying autologin against some of the sites in the training data, I found that some sites have changed since the data was collected and won't work.
Formasaurus is ignoring 'button' elements, which in these cases are being used for the submit instead of an input element and are required to login.

Examples:

The problem mainly seems to be that Buttons are Elements or HtmlElements. Unlike InputElements these don't have .name or .type attributes so are filtered by if getattr(f, 'name', None), and then if I modify the code so that that doesn't filter them it blows up later on when it assumes it's got .name and .type attributes.

As a hacky workaround/proof I modified html.load_html to convert all button elements to input elements:

    parsed = lxml.html.fromstring(html, base_url=base_url, parser=parser)
    for node in parsed.xpath('//button'):
        new_node = etree.Element("input")
        for a,b in node.items():
            new_node.set(a, b)
        node.getparent().replace(node, new_node)
    return parsed

After which autologin worked on the above sites.

Thanks,

Tom

CAPTCHA image

The classification for a CAPTCHA input is very good. It would be helpful if the library could also classify which <img> element is associated with the CAPTCHA.

Max_iter error when running Formasaurus init

Catching this output when running formasaurus init for the first time. Uncertain if it has any negative effect on prediction afterwards. Figured I'd bring it up.

On fresh setup of formasaurus running init causes this:

Loading training data...
Loading: 954 files [00:05, 176.17 files/s] 

Training form type detector on 1426 example(s)...
/root/ericsn0/n0/src/n0/modules/testvenv/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
Training field type detector...
Training on 1363 forms
Using precise form types
Extracting features
Python3.7.7

Virtual env packages:

Package          Version
---------------- ---------
certifi          2020.6.20
chardet          3.0.4
docopt           0.6.2
formasaurus      0.9.0
idna             2.10
joblib           0.16.0
lxml             4.5.2
numpy            1.19.1
pip              20.1.1
python-crfsuite  0.9.7
requests         2.24.0
requests-file    1.5.1
scikit-learn     0.23.1
scipy            1.5.2
setuptools       49.2.0
six              1.15.0
sklearn          0.0
sklearn-crfsuite 0.3.6
tabulate         0.8.7
threadpoolctl    2.1.0
tldextract       2.2.2
tqdm             4.48.0
urllib3          1.25.10
w3lib            1.22.0
wheel            0.34.2

PyPI package is broken and out of date

The version at https://pypi.org/project/formasaurus/ is still 0.8.1, which is broken upon install because of reasons described in #26.

Version 0.9.0 was released two years ago, but never made it to PyPI, presumably because of a broken Travis test, which arguably shouldn't even exist in this form anymore because everything below Python 3.7 was EOL'd months and years ago.

This can be worked around by installing from GitHub (see #26), but that should not be a permanent solution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.