Code Monkey home page Code Monkey logo

Comments (17)

deadeyegoodwin avatar deadeyegoodwin commented on May 22, 2024 6

We have just started work on implementing a shared-memory AP (option C). Changes will start to come into master and we expect to have an initial minimal implementation in about 3 weeks. The API will allow input and output tensors to be passed to/from TRTIS via shared-memory instead of over the network. It will be the responsibility of an outside "agent" to create and manage the lifetime of the shared-memory regions. TRTIS will provide APIs that allow that "agent" to register/unregister these shared memory regions with TRTIS and then they can be used in inference requests.

from server.

deadeyegoodwin avatar deadeyegoodwin commented on May 22, 2024 3

By this week we should have shared memory support for input tensors with some minimal testing. Output tensor support will follow shortly after. Adding support to perf_client plus much more extensive testing is needed after that before we can declare system memory (CPU) sharing complete. That will likely take a couple of weeks. After that we will start on GPU shared memory.

from server.

deadeyegoodwin avatar deadeyegoodwin commented on May 22, 2024 3

The master branch now has the initial implementation for shared memory support for input tensors and some minimal testing.

Currently only the C++ client API supports shared memory (Python support is TBD.. but you can always use grpc to generate client code for many languages). The C++ API changes are here: 6d33c8c#diff-906ebe14e6f98b22609d12ac8433acc0

An example application is: https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/clients/c%2B%2B/simple_shm_client.cc. The L0_simple_shared_memory_example test performs some minimal testing using that example application.

from server.

deadeyegoodwin avatar deadeyegoodwin commented on May 22, 2024 1

In general I think your assessment is correct: I/O can be a performance limiter for some models and a primary way to fix this in many cases is to make the pre-processing local with the inference. Here are some variations we think about and where we stand as far as current support:

  1. Pre-processing "service" running on same node as TensorRT Inference Server (TRTIS).
    a. Use GRPC (or HTTP) to communicate from pre-processor -> TRTIS. Since communication is now local it may no longer be a bottleneck...
    b. For even higher BW between pre-processor -> TRTIS, remove the GRPC/HTTP protocol overhead by implementing a custom/raw socket API. The internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. Another option here is a flatbuffer interface which we have also thought about but not done anything with as yet.
    c. Use shared-memory as you suggest... this would likely require a custom TRTIS API to communicate the shared-memory reference so is similar to (b).
    d. For maximum bandwidth you could share GPU memory between pre-processor and TRTIS and use that for communication. The pre-processor would leave the input tensors in GPU memory and just share the location (via CUDA IPC) with TRTIS. We want to add some functionality to TRTIS to support this but currently we have not.
  2. Avoid communication completely by implementing the pre-processor within TRTIS. Again, the
    internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. In general we are interested in generic pre-processor "add-ons" of this kind that we can incorporate into TRTIS as build-time options.

I would suggest that you start with (1a) and see how much benefit that gets you. We are generally interested in improving TRTIS in this area and so would welcome your experience and feedback as you experiment. If you think you could contribute something generally useful we would be very open to working with you on it, just be sure to include us in your plans early on so we can make sure we are all on the same page.

As for your question #3. Yes, for experimenting it is probably fastest to hack up the gRPC service to instead pass the reference instead of the actual data (but keep the rest of the request/response message the same). infer.cc is where the data (raw_input) is read out of the request message so you would need to change that to instead read from shared memory.

from server.

philipp-schmidt avatar philipp-schmidt commented on May 22, 2024 1

@deadeyegoodwin @CoderHam what are the chances the three shared memory branches will make it onto master this week? And will perf client support a shared memory test out of the box?

I tried building the shared memory branches, but I'm not sure I'm getting the combination of server and clients of the different branches right. What would be the easiest way to get a little test going? Building the server on "hemantj-sharedMemory-server" and then use the simple shm client from "hemantj-sharedMemory-test"? For now I'm only interested in the performance gains and resulting throughput, so I'm basically fine with an unstable, buggy demo if it at least runs somehow. Changes in code look great so far, thanks for the good work!

from server.

CoderHam avatar CoderHam commented on May 22, 2024 1

That's right the simple_shm_client is currently set up to use both input and output with shared memory by default (-I for only input and -O for only output).

Yes, --ipc=host is necessary I will remember to add this to the docs when the output shared memory is also completed.

Works perfectly, thanks!

I had to add "--ipc=host" to docker (has been mentioned somewhere in a related pull request) and use the -I flag to prevent the simple_shm_client from using output shared memory, if anyone else is trying.

from server.

deadeyegoodwin avatar deadeyegoodwin commented on May 22, 2024 1

System shared memory is complete and available on master branch and 19.10. CUDA shared memory is in progress and will be available in 19.11. Closing.

from server.

seovchinnikov avatar seovchinnikov commented on May 22, 2024

Hey! Thank you.
I suggested the same idea here https://github.com/NVIDIA/dl-inference-server/issues/24#event-1980712680
so it's obviously popular enhancement, it would be cool to have a basic implementation of point #3

from server.

seovchinnikov avatar seovchinnikov commented on May 22, 2024

Ok, I've implemented it for grpc (only) in very hacky way https://github.com/seovchinnikov/tensorrt-inference-server/tree/file-api
But I've been to hell and back to turn off all sanity checks because it was not intended to take a dynamic-sized input so I hope for a better solution.
@deadeyegoodwin thanks for a very well-structured code, it was not very difficult to figure it out what to tweak

from server.

ryanolson avatar ryanolson commented on May 22, 2024

@mrjackbo - For my projects, I've done exactly what you are describing above. I've created a pre/post-processing service which uses sysv shared-memory to avoid serializing, moving, and deserializing raw tensors over a gRPC message.

You have two granularities of access control on which you can expose the shared-memory segments between processes:

  1. Node level - create shared-memory segments which are exposed to any process running on the node.
  2. Namespace level - created shared-memory segments are only accessible if the processes/containers share the same IPC namespace.

In Kubernetes, you can use a DaemonSet to create node-level IPCs or you can use multiple containers in a Pod, which by default all containers in a Pod share the same IPC namespace. Unfortunately, there is no API (that I am aware of) that allows different Pods to share the same IPC namespace. And example might look like:

      containers:
      - name: shared-memory
        image: my-shared-memory-service-image
        ports:
        - name: grpc
          containerPort: 50049
      - name: trtis
        image: my-customized-trtis-image
        command: ["wait-for-it.sh", "localhost:50049", "--timeout=0", "--", "/opt/tensorrtserver/bin/trtserver", "--model-store=/tmp/models"]
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - name: metrics
          containerPort: 8002

In this example, the shared-memory service receives async gRPC incoming requests, then using the async gRPC client to forward them to TRTIS. You have to customize TRTIS's protobuf API definition so you can pass a segment_id and offset rather than a bytes object.

Note: the TRTIS service also needs to be customized to connect to the shared-memory service and handshake segment IDs.

Using Docker, you can use the --ipc flag directly. You can create the your shared memory service, then use --ipcs=container:<shared_memory_service_container_name>. The Docker Compose file might looks something like:

  sharedmemory:
    image: my-shared-memory-service-image
    ports:
    - 3333:50051
  tensorrt:
    depends_on:
    - sharedmemory
    image: my-customized-trtis-image
    ipc: container:inferencedemo_sharedmemory_1

I'll post and example of the shared-memory service I have used and update this thread when it's ready. The example is currently in an older project that is actively being moved to a new github project.

from server.

pmcgraw-lucidyne avatar pmcgraw-lucidyne commented on May 22, 2024

Hello, I am now tackling this and before I get too far in the weeds I figured I would follow up here given the latest release r19.04 supports custom operations at build time or startup. I was wondering if I wanted to avoid having to write data to files before making a request to TRTIS, would I be looking at writing a custom operation, or am I looking at using the existing API somehow?

from server.

deadeyegoodwin avatar deadeyegoodwin commented on May 22, 2024

I assume when you say "custom operation" you mean "custom backend".

You could create a custom backend that expected the input to identify the shared memory handle, offset, size, etc. The custom backend would extract the data from the shared memory handle into one or more output tensors. You could then ensemble this custom backend with your actual model. When you made a request your input tensor would be just the shared memory handle, offset, size, etc. data expected by your custom backend. An example of using a custom backend with an ensemble will be included in the 19.05 release (coming later this week). Or you can find it now on master: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/client.html#ensemble-image-classification-example-application

Note that eventually we will have a new API (or enhancement to existing API) that will allow you to send a "handle" to the shared memory containing the input tensor instead of the actual tensor values. But we don't have a schedule yet for when that will be available.

from server.

pmcgraw-lucidyne avatar pmcgraw-lucidyne commented on May 22, 2024

from server.

philipp-schmidt avatar philipp-schmidt commented on May 22, 2024

Note that eventually we will have a new API (or enhancement to existing API) that will allow you to send a "handle" to the shared memory containing the input tensor instead of the actual tensor values. But we don't have a schedule yet for when that will be available.

@deadeyegoodwin Anything new regarding this topic? Using shared memory would probably double (if not even triple) my throughput at this point, so I will have to implement one of the many mentioned solutions above anyway. I will of course share my insights if needed, so it would be useful to know what the current state and plan is API-wise.

This is where I'm at right now:

a. Use GRPC (or HTTP) to communicate from pre-processor -> TRTIS. Since communication is now local it may no longer be a bottleneck...

This unfortunately does not increase performance a lot, as HTTP (and to less extent gRPC) seem to become the major bottleneck with large input tensors (608x608x3 in this case) quite rapidly, even on localhost:

root@pc001:/workspace/build# ./perf_client -m yolov3_trt -t 16 -p 15000 -b 32              
*** Measurement Settings ***
  Batch size: 32
  Measurement window: 15000 msec
  Reporting average latency

Request concurrency: 16
  Client: 
    Request count: 54
    Throughput: 115 infer/sec
    Avg latency: 4532481 usec (standard deviation 2339987 usec)
    Avg HTTP time: 4524200 usec (send/recv 3499134 usec + response wait 1025066 usec)
  Server: 
    Request count: 64
    Avg request latency: 1036893 usec (overhead 15472 usec + queue 651073 usec + compute 370348 usec)

4.5 seconds HTTP time versus 1 second server compute time with 1135 MBit (!) per 32-batch. Not sure if allocation and initialization of the batch on client side is included though.

b. For even higher BW between pre-processor -> TRTIS, remove the GRPC/HTTP protocol overhead by implementing a custom/raw socket API. The internal APIs to allow this are already available within TRTIS and we plan to formalize and document them better in the future. Another option here is a flatbuffer interface which we have also thought about but not done anything with as yet.

I think this could be a very fast solution, even compared with shared-memory approaches. A quick test with iperf on localhost indicates that transmission on loopback is mainly CPU bound:

root@pc001:~$ iperf -c localhost
------------------------------------------------------------
Client connecting to localhost, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 127.0.0.1 port 44234 connected with 127.0.0.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  96.3 GBytes  82.7 Gbits/sec

So I suppose TCP slow start must be disabled or worked around, but this might be sufficient, even in comparison to shared memory.

c. Use shared-memory as you suggest... this would likely require a custom TRTIS API to communicate the shared-memory reference so is similar to (b).

Probably the fastest and "cleanest" solution. Would love to see this supported in the API, without the need for a custom backend receiving shared memory handles and passing the data on in an ensemble. Right now this is the way to go though I guess? So I will try that first before checking b). Any additional input is much appreciated.

from server.

philipp-schmidt avatar philipp-schmidt commented on May 22, 2024

Works perfectly, thanks!

I had to add "--ipc=host" to docker (has been mentioned somewhere in a related pull request) and use the -I flag to prevent the simple_shm_client from using output shared memory, if anyone else is trying.

from server.

philipp-schmidt avatar philipp-schmidt commented on May 22, 2024

@CoderHam for the documentation it might also be worth to add that docker is apparently limited to 64MB of shared memory by default, easily surpassed by even the modest batch sizes for some models.
--shm-size=256m increases this limit to e.g. 256MB.

https://stackoverflow.com/questions/30210362/how-to-increase-the-size-of-the-dev-shm-in-docker-container

And thanks for #541, was about to dive deeper into the code when your commits started coming in ;)

from server.

CoderHam avatar CoderHam commented on May 22, 2024

@philipp-schmidt Thanks for bringing the memory limit to my attention. I will go ahead and document the same in the client + server API docs.

from server.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.