Code Monkey home page Code Monkey logo

Comments (10)

frankfliu avatar frankfliu commented on August 26, 2024

This seems a pytorch bug: pytorch/pytorch#121293

from djl.

ebremer avatar ebremer commented on August 26, 2024

My POM:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.ebremer</groupId>
    <artifactId>DJL</artifactId>
    <version>0.0.0</version>
    <packaging>jar</packaging>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>21</maven.compiler.source>
        <maven.compiler.target>21</maven.compiler.target>
        <exec.mainClass>com.ebremer.djl.DJL</exec.mainClass>
    </properties>
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>ai.djl</groupId>
                <artifactId>bom</artifactId>
                <version>0.27.0</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>api</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>basicdataset</artifactId>
            <type>jar</type>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu121</artifactId>
            <classifier>win-x86_64</classifier>
            <version>2.1.1</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>2.1.1-0.27.0</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>1.7.30</version>
        </dependency>
        <dependency>
            <groupId>commons-cli</groupId>
            <artifactId>commons-cli</artifactId>
            <version>1.6.0</version>
        </dependency>
    </dependencies>
</project>

from djl.

frankfliu avatar frankfliu commented on August 26, 2024

can you try:

<dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu117</artifactId>
            <classifier>win-x86_64</classifier>
            <version>1.13.1</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.13.1-0.27.0</version>
            <scope>runtime</scope>
        </dependency>

from djl.

ebremer avatar ebremer commented on August 26, 2024

Failed to load PyTorch native library

[main] INFO ai.djl.util.Platform - Found matching platform from: jar:file:/C:/Users/erich/.m2/repository/ai/djl/pytorch/pytorch-native-cu117/1.13.1/pytorch-native-cu117-1.13.1-win-x86_64.jar!/native/lib/pytorch.properties
Exception in thread "main" ai.djl.engine.EngineException: Failed to load PyTorch native library
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:90)
	at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41)
	at ai.djl.engine.Engine.getEngine(Engine.java:190)
	at ai.djl.engine.Engine.getInstance(Engine.java:145)
	at ai.djl.Model.newInstance(Model.java:72)
	at ai.djl.Model.newInstance(Model.java:61)
	at com.examples.Models.getModel(Models.java:43)
	at com.examples.Training.main(Training.java:33)
Caused by: java.lang.UnsatisfiedLinkError: C:\Users\erich\.djl.ai\pytorch\1.13.1-20221220-cu117-win-x86_64\torch_cuda_cpp.dll: Can't find dependent libraries
	at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
	at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:331)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:197)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:139)
	at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2418)
	at java.base/java.lang.Runtime.load0(Runtime.java:852)
	at java.base/java.lang.System.load(System.java:2025)
	at ai.djl.pytorch.jni.LibUtils.loadNativeLibrary(LibUtils.java:379)
	at ai.djl.pytorch.jni.LibUtils.loadLibTorch(LibUtils.java:195)
	at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:82)
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53)
	... 7 more

from djl.

ebremer avatar ebremer commented on August 26, 2024

I saw the new release and tried the below but I would get the same error, except when I run with BATCH_SIZE = 1 and then it will begin to train fine.

        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu121</artifactId>
            <classifier>win-x86_64</classifier>
            <version>2.2.2</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>2.3.0-0.28.0</version>
            <scope>runtime</scope>
        </dependency>

from djl.

mdxd44 avatar mdxd44 commented on August 26, 2024

got the same issue with
ai.djl.pytorch:pytorch-native-cu121:2.3.0, ai.djl.pytorch:pytorch-jni:2.3.0-0.28.0 and cuda 12.4.1
rollback to
ai.djl.pytorch:pytorch-native-cu117:1.13.1, ai.djl.pytorch:pytorch-jni:1.13.1-0.28.0 and cuda 11.7.0
fixed the issue

from djl.

ebremer avatar ebremer commented on August 26, 2024

@mdxd44 I tried your rollback and I got this

[main] INFO ai.djl.util.Platform - Found matching platform from: jar:file:/C:/Users/erich/.m2/repository/ai/djl/pytorch/pytorch-native-cu117/1.13.1/pytorch-native-cu117-1.13.1-win-x86_64.jar!/native/lib/pytorch.properties
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/asmjit.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/c10.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/c10_cuda.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/caffe2_nvrtc.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_adv_infer64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_adv_train64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_cnn_infer64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_cnn_train64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_ops_infer64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_ops_train64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/fbgemm.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/libiomp5md.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/nvToolsExt64_1.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/nvrtc64_112_0.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch_cpu.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch_cuda.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch_cuda_cpp.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch_cuda_cu.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/uv.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/zlibwapi.dll to cache ...
Exception in thread "main" ai.djl.engine.EngineException: Failed to load PyTorch native library
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:90)
	at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41)
	at ai.djl.engine.Engine.getEngine(Engine.java:190)
	at ai.djl.engine.Engine.getInstance(Engine.java:145)
	at ai.djl.Model.newInstance(Model.java:72)
	at ai.djl.Model.newInstance(Model.java:61)
	at com.examples.Models.getModel(Models.java:43)
	at com.examples.Training.main(Training.java:33)
Caused by: java.lang.UnsatisfiedLinkError: C:\Users\erich\.djl.ai\pytorch\1.13.1-20221220-cu117-win-x86_64\torch_cuda_cpp.dll: Can't find dependent libraries
	at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
	at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:331)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:197)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:139)
	at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2418)
	at java.base/java.lang.Runtime.load0(Runtime.java:852)
	at java.base/java.lang.System.load(System.java:2025)
	at ai.djl.pytorch.jni.LibUtils.loadNativeLibrary(LibUtils.java:379)
	at ai.djl.pytorch.jni.LibUtils.loadLibTorch(LibUtils.java:195)
	at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:82)
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53)
	... 7 more
Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
    at org.apache.commons.exec.DefaultExecutor.executeInternal (DefaultExecutor.java:404)
    at org.apache.commons.exec.DefaultExecutor.execute (DefaultExecutor.java:166)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947)
    at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:328)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:316)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:212)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:174)
    at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 (MojoExecutor.java:75)
    at org.apache.maven.lifecycle.internal.MojoExecutor$1.run (MojoExecutor.java:162)
    at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute (DefaultMojosExecutionStrategy.java:39)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:159)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:105)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:73)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:53)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:118)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:261)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:173)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:101)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:906)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:283)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:206)
    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke (DirectMethodHandleAccessor.java:103)
    at java.lang.reflect.Method.invoke (Method.java:580)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:283)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:226)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:407)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:348)
------------------------------------------------------------------------
BUILD FAILURE
------------------------------------------------------------------------
Total time:  16.182 s
Finished at: 2024-05-20T08:43:46-04:00
------------------------------------------------------------------------
Failed to execute goal org.codehaus.mojo:exec-maven-plugin:3.1.0:exec (default-cli) on project DJL: Command execution failed.: Process exited with an error: 1 (Exit value: 1) -> [Help 1]

To see the full stack trace of the errors, re-run Maven with the -e switch.
Re-run Maven using the -X switch to enable full debug logging.

For more information about the errors and possible solutions, please read the following articles:
[Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

for pom

        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu117</artifactId>
            <classifier>win-x86_64</classifier>
            <version>1.13.1</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.13.1-0.28.0</version>
            <scope>runtime</scope>
        </dependency>

from djl.

mdxd44 avatar mdxd44 commented on August 26, 2024

@ebremer you can try to use this tool to identify which dependency it can't find

from djl.

ebremer avatar ebremer commented on August 26, 2024

Some measure of success...
I removed all CUDA libraries I had on my system (there were several) and installed 11.7 only. Everything worked with:

        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu117</artifactId>
            <classifier>win-x86_64</classifier>
            <version>1.13.1</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.13.1-0.28.0</version>
            <scope>runtime</scope>
        </dependency>

I removed 11.7 and installed 12.1 only. I updated the pom to

        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu121</artifactId>
           <classifier>win-x86_64</classifier>
            <version>2.2.2</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>2.2.2-0.28.0</version>
            <scope>runtime</scope>
        </dependency>

but it could not find the cuda environment...

[main] WARN ai.djl.util.Platform - The bundled library: cu121-win-x86_64:2.2.2-20240505} doesn't match system: cu065-win-x86_64:2.2.2
[main] INFO ai.djl.util.Platform - Ignore mismatching platform from: jar:file:/C:/Users/erich/.m2/repository/ai/djl/pytorch/pytorch-native-cu121/2.2.2/pytorch-native-cu121-2.2.2-win-x86_64.jar!/native/lib/pytorch.properties
[main] WARN ai.djl.pytorch.jni.LibUtils - No matching cuda flavor for win-x86_64 found: cu065.
[main] INFO ai.djl.pytorch.engine.PtEngine - PyTorch graph executor optimizer is enabled, this may impact your inference latency and throughput. See: https://docs.djl.ai/docs/development/inference_performance_optimization.html#graph-executor-optimization
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 32
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 16
[main] INFO ai.djl.training.listener.LoggingTrainingListener - Training on: cpu().
[main] INFO ai.djl.training.listener.LoggingTrainingListener - Load PyTorch Engine Version 2.2.2 in 0.012 ms.

Training:      0% |=                                       | Accuracy: _, SoftmaxCrossEntropyLoss: _
Training:      0% |=                                       | Accuracy: _, SoftmaxCrossEntropyLoss: _
Training:      0% |=                                       | Accuracy: _, SoftmaxCrossEntropyLoss: _
Training:      0% |=                                       | Accuracy: _, SoftmaxCrossEntropyLoss: _
Training:      0% |=                                       | Accuracy: 0.47, SoftmaxCrossEntropyLoss: 2.46
Training:      0% |=                                       | Accuracy: 0.47, SoftmaxCrossEntropyLoss: 2.46

from djl.

ebremer avatar ebremer commented on August 26, 2024

Success!
I did some debugging on the DJL code and found out what was happening and not happening.
Tracing through ai.djl.util.cuda.loadLibrary(), the code would find C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll because CUDA_PATH was not defined, it was defined as CUDA_PATH_v12.1 by the installer. Further, C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin was not on the PATH. I renamed CUDA_PATH_v12.1 to CUDA_PATH since that is the environmental variable looked for at line 241. At this point, it failed to load because line 253 breaks the path for C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\cudart64_12.dll down to just the filename without the path and so it failed to load at line 255. I added C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin to the PATH variable and it was then able to load the CUDA 12.1 enabling GPU training to finally run.

from djl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.