Code Monkey home page Code Monkey logo

Comments (31)

cbeuw avatar cbeuw commented on June 20, 2024 2

I think I managed to get a reliable reproduction of this issue locally by creating artificial latency with https://github.com/Shopify/toxiproxy and making 10 parallel connections

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024 1

It could be issues in the android wifi driver when operating in hotspot mode. Cloak transmit code / receive code is very simple (just goroutine loop copying both sides). If there's any latency issues or lags, it could be issues or tuning needed in go scheduler itself. It can also be a tcp stall/breakdown because when there's a proxy kind of situation, there appears a phenomenon similar to buffer bloat where the 2 tcp connections on both sides of cloak are both operating with different network conditions and don't "couple" their "flow rate" correctly.

Sometimes when there's a sudden network lag or throughput drop on one side, the other side doesn't feel it, and the congestion in one side of cloak is not properly transmitted as back pressure to the other side to slow down its flow rate accordingly. This is due to too large buffers used in Cloak (at the tcp level, or Cloak level) while copying between the connections.

I tried to tackle this in my fork by adding an extra configuration option or two to set the buffer size, but with limited or no effect. I suspect I haven't done something correctly, that's why I didn't make a PR on this repo with these changes yet.

Try my fork and see if playing with the extra config options related to receive buffer and send buffer changes anything in the effect you're seeing.

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

Another symptom of this breakdown is that the connection doesn't achieve the maximum download rate, because the TCP congestion algorithm depends on feeling packet loss and backpressure to optimize the rate so when there's a desync in the coupling, it's not able to achieve the max possible throughput also.

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

Well @notsure2 , I tried the small-tcp-buffer branch and playing with buffer sizes have not improved anything alas.

However, I did find a workaround (a completely insane one but it works!)

I have installed QubesOS and separated Cloak into a separate proxyVM between the VM running OpenVPN and VM containing passthrough for network card

So now it looks like

Laptop with Cloak (QubesOS)
[Laptop's OpenVPN VM]
+
[Laptop's Cloak VM]
+
[Laptop's Network card VM]

|
|wifi (android hotspot)
|
Android Phone
|
internet

And this boondongle works

Cloak works very well, both TCP and UDP, almost without flaw (well, hiccups I described in another issue persist, but IMO they got better)

Also ping to my test remote improved by about 10 +/- 2 ms

I think this suggests that Cloak is somehow interacting poorly with my laptop's network card driver . Weird.

P.S.:
I do think that fine tuning buffers is an excellent idea tho
You should PR it after a check up.
In the meantime, what would be a good process for determining good values for buffer sizes? I sorta think I might play around with it more during my downtime

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

Start it at 4MB and half it each time and see how this changes the behavior. Ideally you want it really really small, so that if the connection on other side seizes up or stalls, cloak also stalls the other side by stopping the copy (so that the other side couples its congestion algorithm to the clogged side's congestion algorithm).

Edit: also use the notsure2 branch as I may have made some bugfixes since merging it. The binaries i have on my fork are from the notsure2 branch so you don't need to build it.

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

@notsure2 Cool! I'll play with your notsure branch from now on. On an unrelated note, what's up with the support-cloudflare branch, does it work (and how to use it) ?

P.S.:
I was too early to speak about "everything works" after the Qubes "hack"

UDP still suffers a "hard stall" problem, but TCP appears to be immune (or at least stall takes a long while to develop)

Which I used to try and capture something in a log server-side, with what might be success

Procedure I followed - restarted everything, server included

Started Cloak clients in their own VM (networking configured appropriately)
Different UIDs for UDP and TCP clients to make sure they can be differentiated in server's log

Started an openVPN UDP session, and a ping to get idea when connection stalls (also visible in cloak client's trace output, the "reads" just stop)

As soon as UDP "irrecoverable stall" was detected I restarted openVPN with TCP config and connected to Cloak server to grab some logs.

Following logs:

Aug 07 22:03:21 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:03:21Z" level=info msg="Listening on :443"

Aug 07 22:03:21 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:03:21Z" level=info msg="Listening on :80"

Aug 07 22:03:23 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:03:23Z" level=info msg="New session" UID="UDP-CLIENT-UID==" sessionID=1248497831

Aug 07 22:06:39 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:04:39Z" level=info msg="New session" UID="!!!!!!TCP-CLIENT-UID==" sessionID=3365958504

Aug 07 22:06:58 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:04:58Z" level=warning msg="decryption/authentication faliure: cipher: message authentication failed" UID= encryptionMethod=0 proxyMethod= remoteAddr="CENSORED:36268" sessionId=0

Aug 07 22:06:59 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:04:59Z" level=warning msg="decryption/authentication faliure: cipher: message authentication failed" UID= encryptionMethod=0 proxyMethod= remoteAddr="CENSORED:37902" sessionId=0

Aug 07 22:07:07 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:07Z" level=info msg="Session closed" UID="UDP-CLIENT-UID==" reason=timeout sessionID=1248497831

Aug 07 22:07:07 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:07Z" level=info msg="Terminating active user" UID="UDP-CLIENT-UID==" reason="no session left"

Aug 07 22:07:41 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:41Z" level=info msg="New session" UID="UDP-CLIENT-UID==" sessionID=202665511

Aug 07 22:07:54 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:54Z" level=warning msg="error reading first packet: read error after connection is established: EOF" remoteAddr="CENSORED:50530"

P.P.S.:
AAAAAAND AS SOON AS I TYPED IN "TCP APPEARS IMMUNE" THE TCP CONNECTION ALSO HAD A STALL

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

Okay it seems like limiting buffers to exactly same size (currently all set to exactly same 5mb size) for both send and receive on all ends AND changing the nofile limit for client and server to something big (30k or so) massively improves things.

I will let test rig run overnight and see if Cloak UDP seizes up or nah

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

Okay, @notsure2 after playing around with buffers a fair lot I have following to report:
Having buffers configured to around 5 mb (same on server and client) massively improved connection throughput on my current link (I can push 20MB/s both up and down on this crappy finicky connection) but does not resolve stalls.

NOFILE is sadly likely placebo effect (makes sense, lsof shows very few file descriptors open by cloak client)

Stalls become harder to reproduce but very very much remain
UDP Cloak would stall once several hours or so, but when it stalls it stalls for good (even full client restart doesn't always clear them for some mystery reason)

TCP cloak stalls less frequently still (I'd say I could push a full workday in without a stall in a realistic setting)

but unlike with UDP a peculiar phenomenon emerged with TCP Cloak stall and 5mb buffer:

it became possible to semi-reliably trigger a TCP cloak stall by pushing the connection as much as possible (two 720HD 60 fps videos + speedof.me test on loop + ping running in background do it for mine), after mere minutes of such abuse TCP-Cloak would stall and connection would not recover until client restart.

Cloak client logs remain somewhat unenlightening IMHO:

INFO[2023-08-08T14:39:07****] Attempting to start a new session
TRAC[2023-08-08T14:39:07****] client hello sent successfully
TRAC[2023-08-08T14:39:07****] waiting for ServerHello
TRAC[2023-08-08T14:39:07****] client hello sent successfully
TRAC[2023-08-08T14:39:07****] waiting for ServerHello
TRAC[2023-08-08T14:39:07****] client hello sent successfully
TRAC[2023-08-08T14:39:07****] waiting for ServerHello
TRAC[2023-08-08T14:39:07****] client hello sent successfully
TRAC[2023-08-08T14:39:07****] waiting for ServerHello
DEBU[2023-08-08T14:39:07****] All underlying connections established
INFO[2023-08-08T14:39:07****] Session 227931397 established
TRAC[2023-08-08T14:39:07****] stream 1 of session 227931397 opened
TRAC[2023-08-08T14:39:09****] stream 1 actively closed.
DEBU[2023-08-08T14:39:09****] session 227931397 has no active stream left
TRAC[2023-08-08T14:39:09****] copying proxy client to stream: EOF
TRAC[2023-08-08T14:39:09****] 0 read from stream 1 with err EOF
TRAC[2023-08-08T14:39:09****] copying stream to proxy client: broken stream
DEBU[2023-08-08T14:39:23****] Setting loopback connection tcp send buffer: 5120
DEBU[2023-08-08T14:39:23****] Setting loopback connection tcp receive buffer: 5120
TRAC[2023-08-08T14:39:23****] stream 2 of session 227931397 opened
TRAC[2023-08-08T14:48:35****] stream 2 actively closed. <--- trouble starts approximately here
DEBU[2023-08-08T14:48:35****] session 227931397 has no active stream left
TRAC[2023-08-08T14:48:35****] copying proxy client to stream: EOF
TRAC[2023-08-08T14:48:35****] 0 read from stream 2 with err EOF
TRAC[2023-08-08T14:48:35****] copying stream to proxy client: broken stream
DEBU[2023-08-08T14:48:36****] Setting loopback connection tcp send buffer: 5120
DEBU[2023-08-08T14:48:36****] Setting loopback connection tcp receive buffer: 5120
TRAC[2023-08-08T14:48:36****] stream 3 of session 227931397 opened
TRAC[2023-08-08T14:49:50****] stream 3 actively closed.
DEBU[2023-08-08T14:49:50****] session 227931397 has no active stream left
TRAC[2023-08-08T14:49:50****] copying proxy client to stream: EOF
TRAC[2023-08-08T14:49:50****] 0 read from stream 3 with err EOF
TRAC[2023-08-08T14:49:50****] copying stream to proxy client: broken stream
DEBU[2023-08-08T14:49:52****] Setting loopback connection tcp send buffer: 5120
DEBU[2023-08-08T14:49:52****] Setting loopback connection tcp receive buffer: 5120
TRAC[2023-08-08T14:49:52****] stream 4 of session 227931397 opened

My non-programmer bellyfeel here is that cloak (likely cloak-client because server seems doing super fine) is silently running out of some resource and doesn't raise an error, instead simply stops sending/receiving

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

@LindaFerum I can't remember the exact details as I found it ultimately useless since Cloudflare throttles the free tier websocket throughput.

  1. you add your domain to cloudflare
  2. you change your domain nameservers to cloudflare
  3. you change the a record of your domain to proxied
  4. on your real server, you setup nginx on port 80
  5. use acme.sh to get a real ssl for your domain
  6. change nginx ssl port to something else and put cloak on port 443 and set redir addr to localhost nginx port
  7. Use CDNWsUrlPath on client to set the secret url that will go to cloak, let's say /cloak
  8. add to nginx config: location /cloak { proxy_pass http://localhost:443/; }
  9. Make sure websocket support is enabled in cloudflare settings
  10. on client set cdn mode, set servername = your real domain, set mode cdn, set CdnWsUrlPath=/cloak
  11. It should work.

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

@notsure2 thanks! I'll try it.
no way to have "ServerName" something relatively inconspicuous with Cloudflare, right?
P.S.:
Any ideas on stalls (see my report on playing around with buffer and provoking stall above)

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

No. CloudFlare shut off domain fronting, so a hostile isp can zero in on your domain and SNI sniff tcp reset it.

I have experienced these stalls also, but I don't know exactly why they happen. My suspicion points to some kind of goroutine issue or scheduler issue. Cloak logging is not good and is hard to interpret flows from it, it needs a some improvement to become more legible.

From your log there might be some kind of dynamic blocking done by the ISP, the session that opens and immediately closes looks like it got a TCP reset from the ISP. Try to collect a wireshark at the same time as stalls happen and analyze the TCP Flows to see if there's any abnormal behavior.

Try to also see by timestamp matching if there's any error reported on the server side. When ISP sends tcp reset, they send it to both sides to cut the connection from both ends. The server will report this in its log. I think tcp reset shows up as EOF.

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

I want some time to tackle this issue very deeply but sadly I don't have it in these days.

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

I did consider this being an "ISP asshole thing" since I'm currently in a relatively hostile environment

I'm not good with wireshark but I may try using a "guaranteed no-asshole" satellite connection to see if Cloak-TCP will stall there.

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

Try NumConn=0. I implemented this specifically to avoid stalls by removing the multiplexing and having each connection map to 1 short-lived session. This works reasonably ok for me.

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

Further reporting:
Inducing a stall over satellite proved much harder but after some connection abuse Cloak-TCP over satellite gave up and stalled.
Of note, it is a high latency low bandwidth thing that doesn't reach 1Mb/s under any circumstances, so maybe that is what is making stalling induction harder on sat
(race-condition like behavior?)

Will try NumConn=0 later

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

Well, @notsure2 for what it's worth NumConn=0 does seem to help with stalls a lot even on "bad/sus" connection (which I think substantially rules out previous stalls were ISP meddling) but it also it affects performance a lot (compared to the blast I can get out of this connection with multiplexing and 5MB buffers) so if something race-like is happening here it may be that improvement is simply due to lower probability of triggering the stall event.

Overall I think I'll stick with multiplexing,5MB buffers and TCP mode for now
Under normal conditions stalls appear quite rare and I think I can jerry rig something that restarts the cloak client when connection times out as stopgap mitigation

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

@cbeuw Awesome! Glad to know I caught an actual real bug and not just some mystery weirdness :)

from cloak.

cbeuw avatar cbeuw commented on June 20, 2024

For those who are interested, this is the bug:

Cloak has a concept of Streams. Each stream represents a TCP connection between the proxied application client and server (there is only one stream per session in UDP mode). As there are usually far fewer Cloak client to Cloak server TLS connections than there are Streams, data from each stream are interleaved in TLS connections.

Upon receiving some data, Cloak server must figure out which stream the data need to go using metadata attached by Cloak client, then send it to the corresponding TCP connection to the application server. This is done in session.recvDataFromRemote. This is called on the thread that manages a Cloak client to server TLS connection, so this function must not block. If it blocked, then all data that were sent to the thread's TLS connection are queued up, which may include data from other streams. If one server-side application has issues, then this will spread to the rest of the session.

To prevent blocking, each stream has a buffer (recvBuffer) that's effectively a channel - data can be concurrently written to and read from it. If the application isn't reading fast enough, its data will be buffered, allowing the TLS thread to move on quickly.

However, a while ago I made an optimisation by adding a WriteTo method to recvBuffer. WriteTo reduces the amount of memory copy by writing data directly into a Writer interface rather than a buffer. This made recvBuffer's non-blocking property dependent on the non-blocking of the Writer - in this case a TCP connection that goes to an application server. This is wrong, because TCP sockets in Cloak are not guaranteed to be non-blocking.

This issue is hard to reproduce because most of the times, writing to a TCP connection to an application appear non-blocking. Most apps will read data as they come, plus there's also a buffer in the OS if the app isn't consuming data fast enough. Writing to a TCP socket only blocks when this OS buffer is full, which doesn't happen often.

But a stall will happen when one application has two incoming connections A and B, where data coming from A is only consumed when something is received on B. Data through A and B may be routed through the same TLS connection with packets for A at the front and packet for B at the very back. Cloak may fill up A's receiving buffer, causing recvDataFromRemote to block before packet for B is processed. This is a deadlock, as A's receiving buffer can only be cleared up once the app sees data on B, but the app can't see data on B until A's receiving buffer is cleared up.

Simply removing the faulty optimisation should fix the issue.

@LindaFerum if you're happy to, could you let me know if the problem is fixed for you? I'll make a new release if you don't see any more problems.

Pre-release with the fix here: https://github.com/cbeuw/Cloak/releases/tag/v2.7.1-pre

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

@cbeuw awesome news, thanks! I'll try to test it out around next week or maybe even on weekend if I get lucky

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

I think I'm still seeing this issue but in UDP. Will try to figure it out.

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

I'm still seeing this issue on TCP also, but it's more rare than before. I think this is not completely solved yet.

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

@cbeuw @notsure2 can confirm it still happening on TCP albeit reproducing it became trickier (not sure what a good process would be, seems "lots of traffic and wait" works, waiting just got longer)

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

Any news on this one ? (no rush just checking back :) )

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

well, @notsure2 I guess it does feel like it.
Observation - when redirecting traffic to Cloak via shadowsocks I get stalls far easier when I use encryption:none on the shadowsocks (shadowsocks-rust allows that out of the box)
Switching to encryption aes-128-gcm or whatever reduces number of Cloak hiccups.

It would appear that shadowsocks spending extra cycles encrypting/decrypting stuff introduces additional latency that makes inducing a scheduling fault somewhere in goroutine harder. Meanwhile with "encryption:none" and high load shadowsocks fires a veritable firehose of packets and that does something bad scheduling-wise on Cloak/goroutine side.

Makes sense?

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

Maybe given that "thing still happens" the issue should be reopened (for proper recordkeeping and tracking and such? ) @cbeuw

from cloak.

cbeuw avatar cbeuw commented on June 20, 2024

Reopened. I can still reproduce this issue locally if I fully load a large number of streams down a small number of TLS connections. I suspect there's something to do with goroutine scheduling when a TCP connection's buffer is full

from cloak.

LindaFerum avatar LindaFerum commented on June 20, 2024

hey @cbeuw @notsure2 sorry to be a useless nagging needy user :-) but any progress on this one?
I've had some success working around this by non-cloak means (basically, moving some of the load to an inferior anti-detection mechanism and also leveraging the fact that Amazon EC2 can shed IPv4 like a lizard sheds its tail lol so when a server gets detected it's a stop-start+update DNS) but every once in a while this bug starts kicking my ass.

from cloak.

notsure2 avatar notsure2 commented on June 20, 2024

@cbeuw maybe you need to take the scheduling into own hands... so have like an infinite loop go routines and queue some function on it to control the order of execution

from cloak.

cbeuw avatar cbeuw commented on June 20, 2024

maybe you need to take the scheduling into own hands

The blocking happens inside IO calls to net.TCPConn. Only Go runtime can schedule goroutines blocked on IO, so I can't really do much about it.

Though @LindaFerum I did figure out this to be part of the problem

case fixedConnMapping:
// FIXME: this strategy has a tendency to cause a TLS conn socket buffer to fill up,
// which is a problem when multiple streams are mapped to the same conn, resulting
// in all such streams being blocked.
conn = *assignedConn
if conn == nil {
conn, err = sb.pickRandConn()
if err != nil {
sb.session.SetTerminalMsg("failed to pick a connection " + err.Error())
sb.session.passiveClose()
return 0, err
}
*assignedConn = conn
}
n, err = conn.Write(data)
if err != nil {
sb.session.SetTerminalMsg("failed to send to remote " + err.Error())
sb.session.passiveClose()
return n, err
}

I made an update to always spread the load out evenly across all TLS conns instead. From local testing it appears more resilient. I could still break the connection (with an error, not silently) if I open say 100 connections at once, though this should be unlikely in real scenarios. This is released in https://github.com/cbeuw/Cloak/releases/tag/v2.9.0

from cloak.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.