Code Monkey home page Code Monkey logo

Comments (18)

wannesm avatar wannesm commented on June 13, 2024 2

I would still not advise it ;-) Setting k also avoids doing DTW computations and that's the expensive part. DTW computations are stopped early once it's clear that it cannot be better anymore than the k-best distance up to that point. Setting k to None is the same as running all comparisons, which could be faster using parallellization etc (like for the distance_matrix computation).

from dtaidistance.

wannesm avatar wannesm commented on June 13, 2024 1

The datatype is used to compile the C code. So there is currently no easy way to change the datatype. But this is a variable in our setup, so we can easily change it if you recompile yourself. I pushed a branch where all files are generated for the float C-type (thus np.float32): https://github.com/wannesm/dtaidistance/tree/feature/float

(warning: I did not test this extensively, just some quick tests with np.array([...], dtype=np.float32) )

ps: In principle we could generate multiple versions and combine them but this requires quite some disentanglement and extra code. And you are the first one to need this outside of our lab ;-)

from dtaidistance.

wannesm avatar wannesm commented on June 13, 2024 1

There are three manual steps steps:

  1. Change double to float in lines 22-23 in dtaidistance/jinja/generate.py
    seq_t = "double"
    seq_tpy = "double"
  2. Change double to float line 19 in dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h
  3. Regenerate the source code files with:
cd dtaidistance/jinja/
make

After this compiling via the setup.py file (or pip) should result in a version that expects float instead of double.

from dtaidistance.

wannesm avatar wannesm commented on June 13, 2024 1

The makefile could get confused about which files to update after changing branches (and thus didn't change some of the files which triggers the error you got). It now regenerates all files by default. The master and feature/float branches are updated.

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024 1

@wannesm yes, it's usually low, but then at the very end of returning the k best results it spikes to several gigabytes. This doesn't happen when k != None. I'm okay with setting k so this is not a big problem for me.

I pulled in master, changed the doubles to floats in the config files, and compiled, and installed using pip, and it does work now. 🎉 I wonder if the pip install method is more reliable than the setup script:

$ git clone https://github.com/wannesm/dtaidistance.git && cd dtaidistance
$ sed -i '' 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h
$ cd dtaidistance/jinja && make && cd ../..
$ pip install .

Note that on unix (not macOS) it would be sed -i 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h

I'll keep k at an integer (not None) given your recommendation in the source code comments (very helpful, thank you).

FYI- the RAM usage is the same when K is at None, even after recent changes (not a problem for me but figured I'd share in case this is interesting to you):

Screen Shot 2022-10-10 at 6 31 34 PM

When setting K it stays at around 60-100MB rather than 1.9GB.

It's great to see peak RAM usage reduced from 150GB to just 13GB. Means I could expand my data set further..

It is very interesting indeed! It's my first time working with this many cores and RAM. I have more things to work on especially around relaxing my window requirements (looking into PSI next). Really appreciate the swift responses. It's pretty admirable to see your C code, I am limited to high level languages currently (never got into C but have curiosity towards it).

from dtaidistance.

wannesm avatar wannesm commented on June 13, 2024 1

I didn't anticipate using large values for k. The increase of the memory at the end is the creation of an array to hold all results. But this array creates an expensive object for every match, which is unnecessary. I removed this. Querying results is now lazy and only creates an object if you ask for details on a match (and can garbage collect if you don't need it anymore).

from dtaidistance.

wannesm avatar wannesm commented on June 13, 2024 1

You are not running the make command in the jinja directory in the first version? This is not automatically triggered by the setup.py file because we didn't want to make jinja a required dependency. We add the generated files directly to the git repository.

The exact error you see is also because I further improved the independence of the exact datatype used throughout the code. Almost all occurrences of 'double' have been removed from the C and Cython code and all types are now covered a by a typedef of seq_t. This is done in dtaidistance_globals.pxd (which mirrors dd_globals from the c code). And if you don't run the makefile, this is still set to 'double' for the Cython code (or add dtaidistance_globals.pxd to your sed command).

Ps: To make it easier, I added makefile rules to easily switch between types. You can now drop the sed command and simply do:

cd dtaididistance/jinja && make float && cd ../..

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024

@wannesm this is great, I will try it out asap. Do you know how I can change the config myself and recompile so that I can still benefit from future updates to dtaidistance?

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024

Excellent. I'll try both your generated version and my own compiled version and report back! Appreciate the awesome support as always.

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024

@wannesm ok so I tried your first solution:

pip install --force-reinstall --upgrade --no-deps --no-build-isolation --no-binary dtaidistance git+https://github.com/wannesm/dtaidistance.git@feature/float#egg=dtaidistance

on a m1 macbook pro, and it gave me this error:

https://gist.github.com/tommedema/806ea6b9deb1c10391dad30cf476c5d2

I then tried the second solution (compiling from source after changing the setup files) but got this error:

https://gist.github.com/tommedema/409522a058aa17fa50e944335ccf0663

Again, I really appreciate the help.

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024

@wannesm sounds promising, though re-installing from git (the updated feature/float branch) still gave me this error, even after restarting my jupyter server and kernel:

8:apply]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <string>:1

File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_41941/2997127518.py:117, in parallelTrainTestByQueryIndex(queryIndex)

File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_41941/2997127518.py:31, in getQuerySeriesResults(q, series, additions, minMatchCount, maxMatchCount)

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/subsequence/dtw.py:589, in SubsequenceSearch.kbest_matches_fast(self, k)
    587 def kbest_matches_fast(self, k=1):
    588     self.dists_options['use_c'] = True
--> 589     return self.kbest_matches(k=k)

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/subsequence/dtw.py:592, in SubsequenceSearch.kbest_matches(self, k)
    591 def kbest_matches(self, k=1):
--> 592     self.align(k=k)
    593     if k is None:
    594         return [SSMatch(best_idx, self) for best_idx in range(len(self.distances))]

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/subsequence/dtw.py:565, in SubsequenceSearch.align(self, k)
    563 max_dist = np.inf
    564 for idx, series in enumerate(self.s):
--> 565     dist = dtw.distance(self.query, series, **self.dists_options)
    566     if k is not None:
    567         if len(h) < k:

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/dtw.py:223, in distance(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_c, use_pruning, only_ub)
    221         logger.warning("C-library not available, using the Python version")
    222     else:
--> 223         return distance_fast(s1, s2, window,
    224                              max_dist=max_dist,
    225                              max_step=max_step,
    226                              max_length_diff=max_length_diff,
    227                              penalty=penalty,
    228                              psi=psi,
    229                              use_pruning=use_pruning,
    230                              only_ub=only_ub)
    231 r, c = len(s1), len(s2)
    232 if max_length_diff is not None and abs(r - c) > max_length_diff:

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/dtw.py:340, in distance_fast(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_pruning, only_ub)
    338 s2 = util_numpy.verify_np_array(s2)
    339 # Move data to C library
--> 340 d = dtw_cc.distance(s1, s2,
    341                     window=window,
    342                     max_dist=max_dist,
    343                     max_step=max_step,
    344                     max_length_diff=max_length_diff,
    345                     penalty=penalty,
    346                     psi=psi,
    347                     use_pruning=use_pruning,
    348                     only_ub=only_ub)
    349 return d

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance/dtw_cc.pyx:289, in dtaidistance.dtw_cc.distance()

ValueError: Buffer dtype mismatch, expected 'float' but got 'double'

And, interestingly, building from source (after updating to your new master changes) gave me the opposite error:

[7:apply]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <string>:1

File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_42677/2997127518.py:117, in parallelTrainTestByQueryIndex(queryIndex)

File /var/folders/11/829w0h653zn4z81jgf0mv48r0000gp/T/ipykernel_42677/2997127518.py:31, in getQuerySeriesResults(q, series, additions, minMatchCount, maxMatchCount)

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/subsequence/dtw.py:589, in SubsequenceSearch.kbest_matches_fast(self, k)
    587 def kbest_matches_fast(self, k=1):
    588     self.dists_options['use_c'] = True
--> 589     return self.kbest_matches(k=k)

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/subsequence/dtw.py:592, in SubsequenceSearch.kbest_matches(self, k)
    591 def kbest_matches(self, k=1):
--> 592     self.align(k=k)
    593     if k is None:
    594         return [SSMatch(best_idx, self) for best_idx in range(len(self.distances))]

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/subsequence/dtw.py:565, in SubsequenceSearch.align(self, k)
    563 max_dist = np.inf
    564 for idx, series in enumerate(self.s):
--> 565     dist = dtw.distance(self.query, series, **self.dists_options)
    566     if k is not None:
    567         if len(h) < k:

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw.py:223, in distance(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_c, use_pruning, only_ub)
    221         logger.warning("C-library not available, using the Python version")
    222     else:
--> 223         return distance_fast(s1, s2, window,
    224                              max_dist=max_dist,
    225                              max_step=max_step,
    226                              max_length_diff=max_length_diff,
    227                              penalty=penalty,
    228                              psi=psi,
    229                              use_pruning=use_pruning,
    230                              only_ub=only_ub)
    231 r, c = len(s1), len(s2)
    232 if max_length_diff is not None and abs(r - c) > max_length_diff:

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw.py:340, in distance_fast(s1, s2, window, max_dist, max_step, max_length_diff, penalty, psi, use_pruning, only_ub)
    338 s2 = util_numpy.verify_np_array(s2)
    339 # Move data to C library
--> 340 d = dtw_cc.distance(s1, s2,
    341                     window=window,
    342                     max_dist=max_dist,
    343                     max_step=max_step,
    344                     max_length_diff=max_length_diff,
    345                     penalty=penalty,
    346                     psi=psi,
    347                     use_pruning=use_pruning,
    348                     only_ub=only_ub)
    349 return d

File ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw_cc.pyx:289, in dtaidistance.dtw_cc.distance()

ValueError: Buffer dtype mismatch, expected 'double' but got 'float'

I checked that all my input queries and series are of type np.float32. I also ran pip uninstall dtaidistance before doing any of the above.

from dtaidistance.

wannesm avatar wannesm commented on June 13, 2024

That's indeed surprising. I cannot get it reproduced myself when starting from a clean installation from the feature/float branch and recompiling. Just for reference what I do, from a new virtualenv and git clone (to have access to tests, I included one with subseq search):

$ git clone https://github.com/wannesm/dtaidistance.git
$ cd dtaidistance
$ git co feature/float
$ pip install .
$ cd .. # to not use local repo as package
$ pip install pytest
$ python dtaidistance/tests/test_float.py

The first one is especially surprising. Maybe there is a transformation I forgot about. But it's surprising it's not triggered in the second version.

The second one feels like a wrong version of the compiled library is picked up. It would help to print the mentioned line to see whether the toolbox is simply picking up the wrong compiled library. The .pyx file should mention floats.

$ sed -n '289p' ~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/dtaidistance-2.3.10-py3.10-macosx-12.6-arm64.egg/dtaidistance/dtw_cc.pyx
def distance(float[:] s1, float[:] s2, **kwargs):

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024

@wannesm I think you're right that it must be the data I'm passing in, as it does work when I do something like:

subsequence_search(q.astype(np.float32), series.astype(np.float32), dists_options={'use_c': True, 'max_dist': maxDistance})

It's odd given that I am sure I'm passing in float32, but clearly this must be on my side. Apologies for the back and forth and appreciate the help. I will figure out where the data is somehow not float32 next :)

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024

@wannesm btw - somehow after changing to float32 each core is still using 1.87GB at peak, where the actual data only seems to be about 60MB. Do you think this could be because of the matrix calculations subsequence_search is doing? I wonder if that expands the RAM usage.

Update: I think this is resolved by setting k = 200 (for example) instead of None when calling kbest_matches_fast

RAM usage has now gone from 150GB to 13GB with 80 cores. Thank you :)

from dtaidistance.

wannesm avatar wannesm commented on June 13, 2024

Do you see memory usage go up during running subsequence search? This shouldn't happen too much.

Even without a window, the memory usage for the DTW computation is 2*len(array) (with window it is 2*2*window).
The subsequencesearch is keeping track of all distances for all segments instead of the top-k which is suboptimal. That is a memory usage of 8*nb_segments/1024**2 MiB, and thus still surprising to be higher than the full data. In any case, I removed this array in the master branch as it is not strictly required (it now stores only k distances+indices).

ps: It's interesting to see cases like yours which push the limits with large datasets and many cores.

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024

I didn't anticipate using large values for k. The increase of the memory at the end is the creation of an array to hold all results. But this array creates an expensive object for every match, which is unnecessary. I removed this. Querying results is now lazy and only creates an object if you ask for details on a match (and can garbage collect if you don't need it anymore).

That's perfect. Sounds like I can use k = None again? I'll give it a try!

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024

@wannesm I wanted to bring something to your attention regarding this. When reinstalling on another macbook (intel), I discovered the following behavior when using the setup.py script:

pip uninstall dtaidistance

export LDFLAGS="-L/usr/local/opt/libomp/lib"
export CPPFLAGS="-I/usr/local/opt/libomp/include"

git clone https://github.com/wannesm/dtaidistance.git && cd dtaidistance

sed -i '' 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h

python3 setup.py build_ext --inplace

python3 setup.py install

Results in these double / float errors:

https://gist.github.com/tommedema/d2ab88161e6732a5029e0bfe5ce02485

When I then try dtw.try_import_c I get:

https://gist.github.com/tommedema/3c062c752df68567e0b15dcb9009ba92

However, when installing as before, it works fine:

pip uninstall dtaidistance

export LDFLAGS="-L/usr/local/opt/libomp/lib"
export CPPFLAGS="-I/usr/local/opt/libomp/include"

git clone https://github.com/wannesm/dtaidistance.git && cd dtaidistance

sed -i '' 's/double/float/g' dtaidistance/jinja/generate.py dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/dd_globals.h

cd dtaidistance/jinja && make && cd ../..
pip install .

Here I see no error messages, and the C version can be correctly imported.

Any idea what's causing this difference?

from dtaidistance.

tommedema avatar tommedema commented on June 13, 2024

That worked! My bad for not running make with the setup script. The new make float is sweet, thanks for that :)

Do you recommend using the setup script or just using pip install .? I'm not sure I understand the difference.

BTW- I did see this warning (repeated about 200x) when running make:

/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/_stdio.h:93:16: warning: pointer is missing a nullability type specifier (_Nonnull, _Nullable, or _Null_unspecified) [-Wnullability-completeness]
        unsigned char   *_base;

Full log at https://gist.github.com/tommedema/938ce28659a3257513fc0d819b0dd355

Not sure if it matters, since everything seems to work just fine.

from dtaidistance.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.