Code Monkey home page Code Monkey logo

Comments (4)

kavon avatar kavon commented on September 18, 2024

It seems the problem is that the halomon static library is quite large, and we should turn it into a shared library and see what happens. Even when the Monitor system is skipped entirely or the halo-prepare pass are disabled, we see the same overhead.

I think we're seeing page faults because of so much code in the text section. Notice how we have an almost proportional increase in minor page faults when linking in the halomon static lib vs when not doing so. Weird memory affects like this would explain why O2 and O3 don't see this type of issue, even though they would presumably be more obviously affected by XRay nop-sleds, for example.

➤ ls -liah | grep halo
15743662 -rwxr-xr-x  1 kavon kavon  21K Sep 17 14:56 nohalo
15743663 -rwxr-xr-x  1 kavon kavon 9.6M Sep 17 14:41 withhalo
15738726 -rwxr-xr-x  1 kavon kavon 9.6M Sep 17 14:54 withhalo_nomon
15743701 -rwxr-xr-x  1 kavon kavon 9.6M Sep 17 14:57 withhalo_nomon_noprepare
15743700 -rwxr-xr-x  1 kavon kavon 9.6M Sep 17 14:40 withhalo_noprepare
kavon@zeus:~/p/h/test|master⚡*?
➤ ../build/bin/clang++ -DSMALL_PROBLEM_SIZE -O1 bench/cpp/oopack_v1p8.cpp -o nohalo
kavon@zeus:~/p/h/test|master⚡*?
➤ time ./nohalo 
                         Seconds       Mflops         
Test       Iterations     C    OOP     C    OOP  Ratio
----       ----------  -----------  -----------  -----
Max             15000
Matrix            200
Complex          2000
Iterator        20000

DONE!
36.24user 0.00system 0:36.24elapsed 99%CPU (0avgtext+0avgdata 14884maxresident)k
0inputs+0outputs (0major+3060minor)pagefaults 0swaps
kavon@zeus:~/p/h/test|master⚡*?
➤ ../build/bin/clang++ -DSMALL_PROBLEM_SIZE -O1 -fhalo bench/cpp/oopack_v1p8.cpp -o withhalo_nomon
kavon@zeus:~/p/h/test|master⚡*?
➤ time ./withhalo_nomon 
halo info: Empty Halomon Running!
                         Seconds       Mflops         
Test       Iterations     C    OOP     C    OOP  Ratio
----       ----------  -----------  -----------  -----
Max             15000
Matrix            200
Complex          2000
Iterator        20000

DONE!
49.13user 0.02system 0:49.18elapsed 99%CPU (0avgtext+0avgdata 40060maxresident)k
32inputs+0outputs (1major+4286minor)pagefaults 0swaps
kavon@zeus:~/p/h/test|master⚡*?
➤ ../build/bin/clang++ -DSMALL_PROBLEM_SIZE -O1 -fhalo bench/cpp/oopack_v1p8.cpp -o withhalo_nomon_noprepare
kavon@zeus:~/p/h/test|master⚡*?
➤ time ./withhalo_nomon_noprepare 
halo info: Empty Halomon Running!
                         Seconds       Mflops         
Test       Iterations     C    OOP     C    OOP  Ratio
----       ----------  -----------  -----------  -----
Max             15000
Matrix            200
Complex          2000
Iterator        20000

DONE!
47.81user 0.00system 0:47.84elapsed 99%CPU (0avgtext+0avgdata 40188maxresident)k
0inputs+0outputs (0major+4291minor)pagefaults 0swaps

from halo.

kavon avatar kavon commented on September 18, 2024

We're seeing this bad behavior due to the naive insertion of XRay sleds into all functions. Converting to a shared library didn't change the running time at all.

15739084 -rwxr-xr-x  1 kavon kavon  66K Sep 17 15:41 withhalo_nolib
15743662 -rwxr-xr-x  1 kavon kavon  62K Sep 17 15:44 withhalo_nolib_noxraysleds
15738726 -rwxr-xr-x  1 kavon kavon  74K Sep 17 15:38 withhalo_sharedlib
➤ time ./withhalo_sharedlib > /dev/null
49.20user 0.01system 0:49.26elapsed 99%CPU (0avgtext+0avgdata 25488maxresident)k
0inputs+0outputs (0major+3557minor)pagefaults 0swaps
kavon@zeus:~/p/h/test|master⚡*?
➤ time ./withhalo_nolib > /dev/null
49.25user 0.00system 0:49.31elapsed 99%CPU (0avgtext+0avgdata 14768maxresident)k
0inputs+0outputs (0major+3058minor)pagefaults 0swaps
kavon@zeus:~/p/h/test|master⚡*?
➤ time ./withhalo_nolib_noxraysleds > /dev/null
36.30user 0.00system 0:36.32elapsed 99%CPU (0avgtext+0avgdata 14936maxresident)k
0inputs+0outputs (0major+3060minor)pagefaults 0swaps

from halo.

kavon avatar kavon commented on September 18, 2024

Currently for oopack, with no server running:

 Performance counter stats for './nohalo' (5 runs) -O1 -fno-halo

      38541.593084      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.17% )
                53      context-switches          #    0.001 K/sec                    ( +- 25.44% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
             3,049      page-faults               #    0.079 K/sec                    ( +-  0.02% )
   149,639,664,048      cycles                    #    3.883 GHz                      ( +-  0.08% )
   228,833,644,966      instructions              #    1.53  insn per cycle           ( +-  0.00% )
    65,261,138,636      branches                  # 1693.265 M/sec                    ( +-  0.00% )
           826,234      branch-misses             #    0.00% of all branches          ( +-  0.29% )

      38.546724097 seconds time elapsed                                          ( +-  0.18% )
 Performance counter stats for './withhalo' (5 runs): -O1 -fhalo

      49922.855019      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.12% )
                60      context-switches          #    0.001 K/sec                    ( +- 28.58% )
                 2      cpu-migrations            #    0.000 K/sec                    ( +- 26.50% )
             4,311      page-faults               #    0.086 K/sec                    ( +-  0.02% )
   193,832,699,275      cycles                    #    3.883 GHz                      ( +-  0.06% )
   272,002,846,416      instructions              #    1.40  insn per cycle           ( +-  0.00% )
    92,335,372,229      branches                  # 1849.561 M/sec                    ( +-  0.00% )
           969,310      branch-misses             #    0.00% of all branches          ( +-  0.97% )

      49.948874022 seconds time elapsed                                          ( +-  0.13% )

from halo.

kavon avatar kavon commented on September 18, 2024

The heuristic in 07e24e8 is quite simple: leaf functions with no loop (but may contain cycles) with fewer than 50 LLVM IR instructions are not made patchable. Otherwise all other functions are (exceptions include non-reentrant functions, naked, etc). This already solves the performance issue tracked here for oopack:

 Performance counter stats for './withhalo' (5 runs):

      37850.876872      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.37% )
                46      context-switches          #    0.001 K/sec                    ( +- 21.66% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +- 61.24% )
             4,307      page-faults               #    0.114 K/sec                    ( +-  0.01% )
   147,589,115,753      cycles                    #    3.899 GHz                      ( +-  0.09% )
   230,856,291,463      instructions              #    1.56  insn per cycle           ( +-  0.00% )
    66,265,028,684      branches                  # 1750.687 M/sec                    ( +-  0.00% )
           909,259      branch-misses             #    0.00% of all branches          ( +-  0.22% )

      37.873064170 seconds time elapsed     

from halo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.