Code Monkey home page Code Monkey logo

Comments (25)

msrccsbuild avatar msrccsbuild commented on August 18, 2024

Hi, Xiang,

I am back from my vacation. Still have some jet leg, so please pardon the delay in response.

Can you give me an update on the status of your Prajna deployment? Can you give an update of what works, and what doesn't work at the moment?

Best,
Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Thursday, February 18, 2016 7:12 PM
To: MSRCCS/Prajna [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

uhmm, strange, if I run local cluster, it works, then if I turn back to run real cluster, now it almost works everytime.

Reply to this email directly or view it on GitHubhttps://github.com//issues/156#issuecomment-186028146.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Hi Jin,

Thanks for your help, I can make a Gpu cluster computation works on localhost, and it works to start two deamons on localhost which is mapped to 2 Gpus (your app.config merging works, but I found out that it should be in PrajnaClientExt.exe.config, so you should better mention it somewhere in the document in the future)

I think there are two important issues that prevents me to continue:

  1. this cleanup issue (#156)
  2. the remote daemon doesn't work issue (#157)

for 1, it is not good that I have to kill the process each time, cause it cannot cleanup.
for 2, I remember it works with the nuget binary, but with the binary built from source code, I cannot make them work, even after I setup a domain controller. I'm not very familiar with powershell script and windows domain management, so I'm learning them when I have time, but if you can have a look at this issue and give me some guide on how to setup a remote deamon, that would be very helpful.

Regards,
Xiang.

from prajna.

msrccsbuild avatar msrccsbuild commented on August 18, 2024

Hi, Xiang:

Some quick update. I am working with my colleague on the cleanup issue 1). We will check-in a fix soon. Meanwhile, please ensure that your program runs on x64 configuration.

On the remote daemon doesn't work issue is a little bit vague. I am checking in a change (please wait for the AppVeyor build to complete and I merge the change into the main branch). Can you check out the package, try a run on the remote daemon, and let me know what error message you see?

Thanks,
Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Tuesday, February 23, 2016 11:30 PM
To: MSRCCS/Prajna [email protected]
Cc: msrccsbuild [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

Hi Jin,

Thanks for your help, I can make a Gpu cluster computation works on localhost, and it works to start two deamons on localhost which is mapped to 2 Gpus (your app.config merging works, but I found out that it should be in PrajnaClientExt.exe.config, so you should better mention it somewhere in the document in the future)

I think there are two important issues that prevents me to continue:

  1. this cleanup issue (#156#156)
  2. the remote daemon doesn't work issue (#157#157)

for 1, it is not good that I have to kill the process each time, cause it cannot cleanup.
for 2, I remember it works with the nuget binary, but with the binary built from source code, I cannot make them work, even after I setup a domain controller. I'm not very familiar with powershell script and windows domain management, so I'm learning them when I have time, but if you can have a look at this issue and give me some guide on how to setup a remote deamon, that would be very helpful.

Regards,
Xiang.

Reply to this email directly or view it on GitHubhttps://github.com//issues/156#issuecomment-188121820.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Hi Jin, thanks, I will try x64. Regarding issue 2, what do you mean "check out the package"? could you be more specific?

from prajna.

msrccsbuild avatar msrccsbuild commented on August 18, 2024

Can you checked out the latest Prajna check-in (master branch), and see if the problem 1) and 2) is still there. Please start PrajnaClient.exe with

PrajnaClient.exe -verbose 4

It will offer more debugging information. Please send me an email of the Log file (or put the log file in an online place) so that I can take a look?

Thanks,
Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Wednesday, February 24, 2016 8:08 PM
To: MSRCCS/Prajna [email protected]
Cc: msrccsbuild [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

Hi Jin, thanks, I will try x64. Regarding issue 2, what do you mean "check out the package"? could you be more specific?

Reply to this email directly or view it on GitHubhttps://github.com//issues/156#issuecomment-188603424.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Thanks Jin, I will do this later today and email you about the result.

On Fri, Feb 26, 2016 at 12:10 PM, msrccsbuild [email protected]
wrote:

Can you checked out the latest Prajna check-in (master branch), and see if
the problem 1) and 2) is still there. Please start PrajnaClient.exe with

PrajnaClient.exe -verbose 4

It will offer more debugging information. Please send me an email of the
Log file (or put the log file in an online place) so that I can take a look?

Thanks,
Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Wednesday, February 24, 2016 8:08 PM
To: MSRCCS/Prajna [email protected]
Cc: msrccsbuild [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

Hi Jin, thanks, I will try x64. Regarding issue 2, what do you mean "check
out the package"? could you be more specific?

Reply to this email directly or view it on GitHub<
https://github.com/MSRCCS/Prajna/issues/156#issuecomment-188603424>.


Reply to this email directly or view it on GitHub
#156 (comment).

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Hi Jin,

I ran tests over your newest master branch, here is the result:

  1. The cleanup issue is fixed, it is very smooth now.
  2. The remote deamon problem still exists.

Here is how I did tests of remote deamon:

I start 3 daemons on 2 machines (KINGKONG and MACXIANG). I start them like:

  • On KINGKONG: PrajnaClient.exe -verbose 4
  • On KINGKONG: PrajnaClient.exe -port 1005 -jobports 1250-1300 -verbose 4
  • On MACXIANG: PrajnaClient.exe -verbose 4

And here is my cluster.lst file:

XiangCluster,1082
kingkong,1082
kingkong,1005
macxiang,1082

Now, I run an application (which just gather the machine name and Gpu name from each deamon). The Gpu is get from app configuration merge.

        private static void SayHello(Cluster cluster)
        {
            var dset = new DSet<int> { Name = Guid.NewGuid().ToString("D"), Cluster = cluster };
            var descriptions =
                dset
                .Distribute(Enumerable.Range(0, cluster.NumNodes))
                .Select(i =>
                {
                    var gpuId = Int32.Parse(ConfigurationManager.AppSettings["GpuId"]);
                    var machineName = System.Environment.MachineName;
                    var process = System.Diagnostics.Process.GetCurrentProcess();
                    var gpu = Gpu.Get(gpuId);
                    return $"Hello from {machineName} {gpu} taskId={i} processId={process.Id} threadId={Thread.CurrentThread.ManagedThreadId}";
                })
                .ToIEnumerable()
                .ToArray();
            foreach (var description in descriptions)
            {
                Console.WriteLine(description);
            }
        }

        private static void Main()
        {
            Console.WriteLine("Prajna init...");
            Prajna.Core.Environment.Init();
            Console.WriteLine("Prajna init done.");

            // workaround to let Prajna automatically upload resource assembly.
            Alea.CUDA.CT.LibDevice.Locator.Ping();
            Alea.CUDA.CT.Native.X86.B64.Windows.Locator.Ping();

            var cluster = new Cluster("Cluster.lst");
            Console.WriteLine($"Cluster.NumNodes = {cluster.NumNodes}");

            SayHello(cluster);

            Console.WriteLine("Prajna cleanup...");
            Prajna.Core.Environment.Cleanup();
            Console.WriteLine("Prajna cleanup done");
        }

I ran it twice. At the first run, it hangs for a while, and then returns nothing, and cleanup.

The second run, it hangs for a while, then it just throw exception, something like "wrong string format". Here is the screenshot of that exception:

exception_during_second_run

I zipped the logs from two machines: (note, there are 2 deamons on KINGKONG, so it has double logs):

Log_KINGKONG.zip
Log_MACXIANG.zip

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

I just checked the exception, it is throw out from my code SayHello, and I located it here:

                    var gpuId = Int32.Parse(ConfigurationManager.AppSettings["GpuId"]);

So, I try to grab an app setting from the PrajnaClientExt.exe.config, that seems work in local host, but looks like it has problem in remote. Hope this can help you.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Ok, I did a new test, this time, I don't use the merged configuration:

        private static void SayHello(Cluster cluster)
        {
            var dset = new DSet<int> { Name = Guid.NewGuid().ToString("D"), Cluster = cluster };
            var descriptions =
                dset
                .Distribute(Enumerable.Range(0, cluster.NumNodes))
                .Select(i =>
                {
                    //var gpuId = Int32.Parse(ConfigurationManager.AppSettings["GpuId"]);
                    var machineName = System.Environment.MachineName;
                    var process = System.Diagnostics.Process.GetCurrentProcess();
                    //var gpu = Gpu.Get(gpuId);
                    var gpu = "Skipped";
                    return $"Hello from {machineName} {gpu} taskId={i} processId={process.Id} threadId={Thread.CurrentThread.ManagedThreadId}";
                })
                .ToIEnumerable()
                .ToArray();
            foreach (var description in descriptions)
            {
                Console.WriteLine(description);
            }
        }

I just replace the Gpu name to "Skipped", so it will not read the merged configuration file.

Then I also ran it twice:

The first time, it hangs for a while, and then no string gathered. (I remember this is always the case even in nuget binary, that is, it always returns nothing during the first run, such as the first run after deleting the C:\Prajna folder)

The second run, now it works :)

2016-02-26_2019

So, I think there are two issues regarding remote deamon:

  1. The first run, you can try to reproduce it by deleting the C:\Prajna folder completely then start deamon again, start freshly. Then in this case, application returns nothing, in another word, DSet.Count = 0
  2. After the first run, then it works, but looks like the configuration file merging doesn't work remotely.

Anyway, I upload the logs again without getting the merged configuration:

Log_KINGKONG_2.zip
Log_MACXIANG_2.zip

Hope that helps.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Woow, got some findings. I changed my code to print on what on earth the configuration string it gets:

        private static void SayHello(Cluster cluster)
        {
            var dset = new DSet<int> { Name = Guid.NewGuid().ToString("D"), Cluster = cluster };
            var descriptions =
                dset
                .Distribute(Enumerable.Range(0, cluster.NumNodes))
                .Select(i =>
                {
                    //var gpuId = Int32.Parse(ConfigurationManager.AppSettings["GpuId"]);
                    var machineName = System.Environment.MachineName;
                    var process = System.Diagnostics.Process.GetCurrentProcess();
                    //var gpu = Gpu.Get(gpuId);
                    var gpu = ConfigurationManager.AppSettings["GpuId"];
                    return $"Hello from {machineName} \"{gpu}\" taskId={i} processId={process.Id} threadId={Thread.CurrentThread.ManagedThreadId}";
                })
                .ToIEnumerable()
                .ToArray();
            foreach (var description in descriptions)
            {
                Console.WriteLine(description);
            }
        }

Now, if I run it (of course, first run fail, but the second run gives me this output):

wrong_configuration

The problem is, the second run, the local computer (KINGKONG), its configuration doubled to "0,0" and "1,1"!!!

Here is my origin configuration files which is used to start the daemon:

origin_config

Then I checked the generated config in C:\Prajna\Job1005\PrajnaTest.CS\069A5FC4043A9837:

wrong_configuration

I think this at least located some problem :)

from prajna.

msrccsbuild avatar msrccsbuild commented on August 18, 2024

Hi, Xiang:

Thanks for identifying the issue. I can repo the configuration issue. It seems to me that I need to detect duplicate configuration setting, and make sure only copy new one. I have checked-in a number of changes (also fixed a load balance issue). Please check if the configuration issues has been fixed.

I cannot repo the remote daemon startup issue. It seems to be that the remote daemon starts fine first time for me.

I do the setup as follows:

  1.   I used two nodes, MachineA and MachineB.
    
  2.   I deleted the folder c:\Prajna ( If you fails to delete c:\Prajna, please check TaskManager to see if there is lingering instances of PrajnaClient.exe or PrajnaClientExt.exe).
    
  3.   I start 2 PrajnaClient.exe on MachineA (one default, one on port 1005) and 1 PrajnaClient.exe on MachineB.
    
  4.   I run PrajnaTest.CS project in the ExtraTest.
    

I don't have any issue with first run (and second run).

One thing is that PrajnaClient.exe needs to create a directory under c:\ for c:\Prajna. Please make sure that the credential of PrajnaClient.exe has sufficient privilege to create the directory. If not, please create directory c:\Prajna with proper security setting inadvance.

I will suggest that you do the following:

  1.   Checkout the latest Prajna build (master branch).
    
  2.   Start PrajnaClient exactly as you have described, but do not modify the configuration setting.
    
  3.   Please check if PrajnaTest.CS (with proper cluster.lst) work for you.
    
  4.   Then, please check if the configuration issue still exist.
    

Let's see if the configuration issue is still there.

Have a nice weekend!
Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Friday, February 26, 2016 4:41 AM
To: MSRCCS/Prajna [email protected]
Cc: msrccsbuild [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

Woow, got some findings. I changed my code to print on what on earth the configuration string it gets:

    private static void SayHello(Cluster cluster)

    {

        var dset = new DSet<int> { Name = Guid.NewGuid().ToString("D"), Cluster = cluster };

        var descriptions =

            dset

            .Distribute(Enumerable.Range(0, cluster.NumNodes))

            .Select(i =>

            {

                //var gpuId = Int32.Parse(ConfigurationManager.AppSettings["GpuId"]);

                var machineName = System.Environment.MachineName;

                var process = System.Diagnostics.Process.GetCurrentProcess();

                //var gpu = Gpu.Get(gpuId);

                var gpu = ConfigurationManager.AppSettings["GpuId"];

                return $"Hello from {machineName} \"{gpu}\" taskId={i} processId={process.Id} threadId={Thread.CurrentThread.ManagedThreadId}";

            })

            .ToIEnumerable()

            .ToArray();

        foreach (var description in descriptions)

        {

            Console.WriteLine(description);

        }

    }

Now, if I run it (of course, first run fail, but the second run gives me this output):

[wrong_configuration]https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fcloud.githubusercontent.com%2fassets%2f3477345%2f13352476%2f7a28ca60-dcc8-11e5-922e-9b0056e38c8d.png&data=01%7c01%7cLi.Jin%40microsoft.com%7c4542e60f41ec42c4c49f08d33eaa088d%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=DOt6ywj%2foJZJo4m5JjNWSHCSdayCtkP%2fCRN8u4xUy8U%3d

The problem is, the second run, the local computer (KINGKONG), its configuration doubled to "0,0" and "1,1"!!!

Here is my origin configuration files which is used to start the daemon:

[origin_config]https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fcloud.githubusercontent.com%2fassets%2f3477345%2f13352533%2fd0a50368-dcc8-11e5-8e15-08c733df32cc.png&data=01%7c01%7cLi.Jin%40microsoft.com%7c4542e60f41ec42c4c49f08d33eaa088d%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=85hJJ4f12kEshHF5b%2fJEWsm0cXYAaX%2fh4VHT7aAObq0%3d

Then I checked the generated config in C:\Prajna\Job1005\PrajnaTest.CS\069A5FC4043A9837:

[wrong_configuration]https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fcloud.githubusercontent.com%2fassets%2f3477345%2f13352569%2f1666d3cc-dcc9-11e5-885f-33993e3b33ee.png&data=01%7c01%7cLi.Jin%40microsoft.com%7c4542e60f41ec42c4c49f08d33eaa088d%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=27A5Gyd%2blimAppK5b5D691dFPe38XbF%2bWYguVBK%2fFL4%3d

I think this at least located some problem :)

Reply to this email directly or view it on GitHubhttps://github.com//issues/156#issuecomment-189259460.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Thanks Jin, I am reading your PS script to correctly start up the client in remote. Currently I find it cannot create PSSession because of the CredSSP is not allowed. I'm continuing check this, because I'm not experienced in Windows domain. I keep you updated.

from prajna.

msrccsbuild avatar msrccsbuild commented on August 18, 2024

I have check-in a setup script for Windows Domain Cluster. You may want to try it out to see if it helps to create the PSSession.

Thanks,
Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Monday, February 29, 2016 3:35 AM
To: MSRCCS/Prajna [email protected]
Cc: msrccsbuild [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

Thanks Jin, I am reading your PS script to correctly start up the client in remote. Currently I find it cannot create PSSession because of the CredSSP is not allowed. I'm continuing check this, because I'm not experienced in Windows domain. I keep you updated.

Reply to this email directly or view it on GitHubhttps://github.com//issues/156#issuecomment-190167891.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Hi Jin,

I follow your setup scripts to setup some management stuff. Also I turned off the firewall completely. And now I can successfully using your scripts to start a client on local machine, (on remote machine it seems work, but some error is printed out, saying something like cannot find the machine)

If I only start one deamon, it works like:
2016-02-28_2140

But now I always got an exception if I run any application, something like "p2p transfer not implemented":
2016-02-29_1507

This exception happens always since yesterday after I synchronized the source code and recompiled Prajna.

Any idea on this exception?

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

More info, if I use "localhost" in cluster.lst file, then it works, if I use ip address in cluster.lst, it also works. once I use computer name in cluster.lst, it throws out that exception.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Update:

Now it works better. I can use IP address instead of hostname in cluster.lst file, then that works. But due to my notebook uses WIFI, and it is not very fast (usually 17Mbps), so it took around 30 seconds to get the application copied to its job folder. In this case, some times it just returns nothing. that might be the reason of why some of my fresh start returns nothing. If you have a high speed LAN, you can try to embedded a huge file in your assembly and then check if it works when the application assembly costs long time to upload to remote.

Second, I still cannot use hostname in cluster.lst file, especially, remote hostname. Once I replace remote machine IP address with its hostname, I got that "P2P not implemented" exception.

I realized there is one error when I deploy clients using your script, it looks like:
2016-02-28_2140

But strange, it works, it can copy clients to remote, and start it.

So I was wonderring if there are some place wrong of setting the remote configurations. So I'm now checking this page: http://www.thomasmaurer.ch/2011/01/quick-powershell-remoting-guide/

BTW, I have two machines, and I use one domain controller, and the computer is joined. So my computer's full names are: kingkong.xiangnet.local and macxiang.xiangnet.local, but of course, I cannot use the full name in deploy script, nor in cluster.lst, in cluster.lst, I have to use just hostname without domain suffix. (BTW, the primary suffix in IPv4 configuration is empty string).

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

I tested, the remoting seems works fine, but if I use your script, in stop-client, when calling invoke-command -computername, it throws that error, but the other part like copy and start works fine. strange.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Found one bug in your script:
https://github.com/MSRCCS/Prajna/blob/master/scripts/WindowsDomainCluster/Deploy-Clients.ps1#L45

you should also pass -Cred $Cred here.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Ok, here I write some conclusion on today's experiment:

  1. The script Deploy-clients.ps1 has a bug, it doesn't forward the -Cred option to Stop-Clients.ps1
  2. A fresh run, you first Stop-clients.ps1, then delete Prajna folder and PrajnaClient folder on each of your machines, then you Deploy-Clients.ps1. Now it is a fresh start. Then you can embedded a huge file in your application assembly. The purpose is to simulate a long time network uploading. On my machines, I observed that if the transferring time is greater than 40 to 50 seconds, it will suddenly have 2 process of Prajna.Client in the remote machine, then both daemon die. Then the application returns nothing. I have to restart the daemons, then it works, because after that, there is not much assemblies need to be uploaded, just your application assembly, all referenced assemblies doesn't need to be re-uploaded. Question, is there a timeout control for uploading? maybe we can increase that.
  3. The cluster.lst file, on my machine, if I use hostname, I will got exception says something like "P2P not implemented", but if I use IP address, then it works. I don't know why, but I can live this in creating a demo cluster.

Regards,
Xiang.

from prajna.

msrccsbuild avatar msrccsbuild commented on August 18, 2024

Thanks for the summary. Just let you know we are still working on the issue.

Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Tuesday, March 1, 2016 4:54 AM
To: MSRCCS/Prajna [email protected]
Cc: msrccsbuild [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

Ok, here I write some conclusion on today's experiment:

  1. The script Deploy-clients.ps1 has a bug, it doesn't forward the -Cred option to Stop-Clients.ps1
  2. A fresh run, you first Stop-clients.ps1, then delete Prajna folder and PrajnaClient folder on each of your machines, then you Deploy-Clients.ps1. Now it is a fresh start. Then you can embedded a huge file in your application assembly. The purpose is to simulate a long time network uploading. On my machines, I observed that if the transferring time is greater than 40 to 50 seconds, it will suddenly have 2 process of Prajna.Client in the remote machine, then both daemon die. Then the application returns nothing. I have to restart the daemons, then it works, because after that, there is not much assemblies need to be uploaded, just your application assembly, all referenced assemblies doesn't need to be re-uploaded. Question, is there a timeout control for uploading? maybe we can increase that.
  3. The cluster.lst file, on my machine, if I use hostname, I will got exception says something like "P2P not implemented", but if I use IP address, then it works. I don't know why, but I can live this in creating a demo cluster.

Regards,
Xiang.

Reply to this email directly or view it on GitHubhttps://github.com//issues/156#issuecomment-190712118.

from prajna.

msrccsbuild avatar msrccsbuild commented on August 18, 2024

Thanks for ping point the bugs in Prajna. Latest check-in address the following issues:

  1.   -Cred option is added to Deploy-clients.ps1
    
  2.   You may use:
    

    Prajna.Core.Environment.SetRemoteContainerEstablishmentTimeout(timeout);
    To adjust the RemoteContainer timeout value.

We will try to make the error message more informative in the future to indicate the timeout. Also, we can't reproduce the "2 process of Prajna.Client" event. Please let us know if this issue still appears, we will investigate further.

  1.   We are unable to reproduce the "P2P not implemented" incident. I will suggest that you do the following:
    

a. Run the Client (e.g., CleanUpTest.exe) with -verbose 3 -con

b. Turn on DNS logging by:

Prajna.Core.Environment. EnableLoggingOnDNS()

            Please take a look at the line which says:
                            Calling Dns.GetHostAddresses, resolve hostname

            It shows how hostname is resolved in Prajna. It may give you clue why hostname doesn't work, while IP address works.

Best,
Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Tuesday, March 1, 2016 4:54 AM
To: MSRCCS/Prajna [email protected]
Cc: msrccsbuild [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

Ok, here I write some conclusion on today's experiment:

  1. The script Deploy-clients.ps1 has a bug, it doesn't forward the -Cred option to Stop-Clients.ps1
  2. A fresh run, you first Stop-clients.ps1, then delete Prajna folder and PrajnaClient folder on each of your machines, then you Deploy-Clients.ps1. Now it is a fresh start. Then you can embedded a huge file in your application assembly. The purpose is to simulate a long time network uploading. On my machines, I observed that if the transferring time is greater than 40 to 50 seconds, it will suddenly have 2 process of Prajna.Client in the remote machine, then both daemon die. Then the application returns nothing. I have to restart the daemons, then it works, because after that, there is not much assemblies need to be uploaded, just your application assembly, all referenced assemblies doesn't need to be re-uploaded. Question, is there a timeout control for uploading? maybe we can increase that.
  3. The cluster.lst file, on my machine, if I use hostname, I will got exception says something like "P2P not implemented", but if I use IP address, then it works. I don't know why, but I can live this in creating a demo cluster.

Regards,
Xiang.

Reply to this email directly or view it on GitHubhttps://github.com//issues/156#issuecomment-190712118.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Hi Jin, thanks for your reply, currently I'm setting up Prajna on AWS, let's see if there have any issues, still in processing. BTW, the firewall rule you listed seems not enough, I have to turn off firewall (or only on domain network profile) to let it continue.

from prajna.

msrccsbuild avatar msrccsbuild commented on August 18, 2024

Thanks. Please elaborate a little bit on your setup environment. I can include the information of setup to help future users if they are also interested to use Prajna on AWS (We have never tried it there. )

Best,
Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Wednesday, March 2, 2016 7:23 PM
To: MSRCCS/Prajna [email protected]
Cc: msrccsbuild [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

Hi Jin, thanks for your reply, currently I'm setting up Prajna on AWS, let's see if there have any issues, still in processing. BTW, the firewall rule you listed seems not enough, I have to turn off firewall (or only on domain network profile) to let it continue.

Reply to this email directly or view it on GitHubhttps://github.com//issues/156#issuecomment-191559595.

from prajna.

soloman817 avatar soloman817 commented on August 18, 2024

Hi Jin, I successfully setup Prajna cluster on AWS. Here are some findings:

  1. In a fresh created Windows domain network, it works with hostname.
  2. The remote daemon problem, I finally catched it :), it is because of the cluster name! So in cluster list file, we wrote a cluster name, and once this cluster is used, it will be stored in each daemon's working folder. Which means, you should not change the configuration of that cluster anymore. If you change it, say, you add a new node, then it will not match the cluster configuration in the daemon's working repository, then strange behavior happens. Previously, when I test it, I first put one node in the list, and run it, if it is ok, then I add a new node in that list, without changing the cluster name, then that won't work anymore. So I suggest you to document this, it can easily go wrong. Also, from a product perspective, Prajna should provide API to remove cluster or update cluster object which lives in daemon's cache.

I did a PI estimation, I used three EC2 machines, one for control, and two for GPU calculation. One Gpu machine is 2xlarge, which has 1 Gpu, the other machine is 8xlarge, which has 4 Gpus, and the test results shows well that it speeds up if you add more nodes:

pi_estimation_1

We will prepare some blog post on how to setup Prajna + Alea on AWS (with EC2 + VPC + Directory Service, it is quite smooth experience :)

from prajna.

msrccsbuild avatar msrccsbuild commented on August 18, 2024

Hi, Xiang:

Thanks very much for the information. I have updated the deployment page of Prajna

https://github.com/MSRCCS/Prajna/wiki/Deploy-Prajna

to clarify that clusterName should be changed after configuration of the cluster change.

I am working on the RemoteRunner functionality that you have requested earlier. Will ping you once the work is done.

Best,
Jin

From: Xiang Zhang [mailto:[email protected]]
Sent: Thursday, March 3, 2016 3:48 AM
To: MSRCCS/Prajna [email protected]
Cc: msrccsbuild [email protected]
Subject: Re: [Prajna] Application hangs in cleanup (#156)

Hi Jin, I successfully setup Prajna cluster on AWS. Here are some findings:

  1. In a fresh created Windows domain network, it works with hostname.
  2. The remote daemon problem, I finally catched it :), it is because of the cluster name! So in cluster list file, we wrote a cluster name, and once this cluster is used, it will be stored in each daemon's working folder. Which means, you should not change the configuration of that cluster anymore. If you change it, say, you add a new node, then it will not match the cluster configuration in the daemon's working repository, then strange behavior happens. Previously, when I test it, I first put one node in the list, and run it, if it is ok, then I add a new node in that list, without changing the cluster name, then that won't work anymore. So I suggest you to document this, it can easily go wrong. Also, from a product perspective, Prajna should provide API to remove cluster or update cluster object which lives in daemon's cache.

I did a PI estimation, I used three EC2 machines, one for control, and two for GPU calculation. One Gpu machine is 2xlarge, which has 1 Gpu, the other machine is 8xlarge, which has 4 Gpus, and the test results shows well that it speeds up if you add more nodes:

[pi_estimation_1]https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fcloud.githubusercontent.com%2fassets%2f3477345%2f13493374%2fb136c390-e178-11e5-97ff-229464ee8795.png&data=01%7c01%7cLi.Jin%40microsoft.com%7cf904f53b44c44f8394db08d34359bd67%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=aj0AyKvp3vj7T5e20dWXm7TXI2ZSDfxN9MJixI45Kck%3d

We will prepare some blog post on how to setup Prajna + Alea on AWS (with EC2 + VPC + Directory Service, it is quite smooth experience :)

Reply to this email directly or view it on GitHubhttps://github.com//issues/156#issuecomment-191722555.

from prajna.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.