pmundkur / libcrm114 Goto Github PK
View Code? Open in Web Editor NEWC library version of CRM114, and a Python binding
Home Page: http://crm114.sourceforge.net/wiki/doku.php?id=download
C library version of CRM114, and a Python binding
Home Page: http://crm114.sourceforge.net/wiki/doku.php?id=download
From http://crm114.sourceforge.net/wiki/doku.php?id=download: CRM114 C-callable Library This is the callable library version of CRM114. It has most of the classifiers as the standalone language (with some significant improvements- one alpha tester says they saw a 10x speedup in their application). This version is LGPLed (Library GPL) so you can link it with your own code, whether open-source or proprietary. You still need TRE (on Fedora, “yum install tre-devel”). Note that with improvements come costs: libcrm114 classifiers are NOT compatible with standalone CRM114 class files (necessary, because libcrm114 classifiers can work even on systems that don't have filesystems, like embedded processors). The code is now pretty stable and the API solidly entrenched by use in several real products, so the api is unlikely to change in unpleasant ways. Advantages of libcrm114: It's much faster; everything is in-memory. You can call everything directly from ANSI C. Because everything is in memory, it's good for embedded systems where you don't _have_ a unix-style file system to talk to. No arcane language to learn, it's all just ANSI C. You can export classifiers as ASCII “CSV-like” format so trained classifiers are 32/64-bit portable and cross-platform Linux/Mac/Windows portable (the internal binary classifier format is still tied to a particular architecture, but that's never exported any more). Disadvantages of libcrm114: Not all classifiers are currently supported (in particular, Neural Net, Correllator, OSBF, and Winnow are NOT yet supported). There's no crazy language, so you need to get your data into memory on your own. You still need TRE. You do pay a (not horrible) startup cost loading a classifier from a an ASCII CSV-like file, but since you can then reuse the classifier for as many documents as you want, in the long term this cost is amortized down to zero and you get significant speedup. Dependencies Debian/Ubuntu: libtre5, libtre-dev Building $ make && cd python && python setup build
The python binding should add support for binary read and dump.
As per doc/HOWTO.txt:
In either case, just write p_db->datablock_size bytes of data to the
disk when saving the db, and read the full saved file size into memory
to restore the db. For example:
FILE *myfile;
myfile = fopen ("/home/me/my_db_file.crm", "wb");
if (!myfile) { printf ("Couldn't open\n"); exit };
fwrite (p_db, 1, p_db->datablock_size, myfile);
fclose (myfile);
and the corresponding code for reading the db back in.
This seems to be a good idea because it would lead to a good speed-up, and less space used:
Quoting again the doc/HOWTO.txt:
// *** OPTIONAL *** Here's how to read and write the datablocks as
// ASCII text files. This is NOT recommended for storage (it's ~5x bigger
// than the actual datablock, and takes longer to read in as well,
// but rather as a way to debug datablocks, or move a db portably
// between 32- and 64-bit machines and between Linux and Windows.
Changed Makefile:
$(CC) -shared -Wl,-soname,$(LIB_NAME) -o lib/$(LIB_NAME) $(LIBOBJS)
to be:
$(CC) -shared -Wl,-dylib_install_name,$(LIB_NAME) -o lib/$(LIB_NAME) $(LIBOBJS)
While running 'make' it gives:
cc -c -g -Iinclude -fpic -std=c99 -pedantic -Wall -Wextra -Wpointer-arith -Wstrict-prototypes -Wno-sign-compare -Wno-overlength-strings -o lib/crm114_regex_tre.o lib/crm114_regex_tre.c
lib/crm114_regex_tre.c: In function ‘crm114__regfree’:
lib/crm114_regex_tre.c:61: warning: ISO C forbids conversion of function pointer to object pointer type
cc -shared -Wl,-dylib_install_name,libcrm114.so.1 -o lib/libcrm114.so.1 lib/crm114_base.o lib/crm114_markov.o lib/crm114_markov_microgroom.o lib/crm114_bit_entropy.o lib/crm114_hyperspace.o lib/crm114_svm.o lib/crm114_svm_lib_fncts.o lib/crm114_svm_quad_prog.o lib/crm114_fast_substring_compression.o lib/crm114_pca.o lib/crm114_pca_lib_fncts.o lib/crm114_matrix.o lib/crm114_matrix_util.o lib/crm114_datalib.o lib/crm114_vector_tokenize.o lib/crm114_strnhash.o lib/crm114_util.o lib/crm114_regex_tre.o
Undefined symbols for architecture x86_64:
"_tre_regncomp", referenced from:
_crm114__regncomp in crm114_regex_tre.o
"_tre_regnexec", referenced from:
_crm114__regnexec in crm114_regex_tre.o
"_tre_regerror", referenced from:
_crm114__regerror in crm114_regex_tre.o
"_tre_regfree", referenced from:
_crm114__regfree in crm114_regex_tre.o
"_tre_free", referenced from:
_crm114__regfree in crm114_regex_tre.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
make: *** [lib/libcrm114.so] Error 1```
I suppose there is something to be added for Makefile to make shared library (tre) to be loaded?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.