Discussion:
"100% CPU usage during SFTP transfer" bugfix
Vourhey
2009-08-18 08:46:52 UTC
Permalink
Hi all,

I use lubcurl 7.19.5 with libssh2-1.2 on Linux. I can see 100% CPU
usage when I try to upload any file by SFTP. This bug is already
described on the tracker:
http://sourceforge.net/tracker/?func=detail&atid=100976&aid=2816745&group_id=976
This bug is due to using non-blocking mode enabled by default.
Infinite loop (invoking of clock_gettime, poll, recv functions) loads
CPU.
I found the following in the libcurl code:
In file lib/ssh.c, function ssh_statemach_act:

    /* Set libssh2 to non-blocking, since cURL is all non-blocking */
    libssh2_session_set_blocking(sshc->ssh_session, 0);

"libssh2_session_set_blocking" function enables/disables non-blocking
mode. So, I've change this code to enable blocking mode for the
libssh2:

    libssh2_session_set_blocking(sshc->ssh_session, 1);

After libcurl recompiling, file uploading by SFTP loads CPU to ~9%.
And looks like everything is ok: downloading/uploading files by SFTP
work fine.

Could you please tell me, can these changes cause any issues in
libcurl work? Are there any plans to submit fix for this bug in the
next version of libcurl? Can the fix described above be included?

Thanks,
Pavel
Daniel Stenberg
2009-08-18 09:05:09 UTC
Permalink
I use lubcurl 7.19.5 with libssh2-1.2 on Linux. I can see 100% CPU usage
when I try to upload any file by SFTP. This bug is already described on the
http://sourceforge.net/tracker/?func=detail&atid=100976&aid=2816745&group_id=976
Is it really? That bug was unclearly described and the reporter never came
back to help us sort it out.

For example, he was using an older libssh2 than you. Are you also on Windows?
This bug is due to using non-blocking mode enabled by default.
No, that's not the reason. libcurl should use libssh2 non-blockingly, so the
bug may be related to that fact but it's not the reason itself.

It does give me the feeling it might be a libssh2 bug, but without more
details we can't tell for sure.
Infinite loop (invoking of clock_gettime, poll, recv functions) loads
CPU.
But why does it loop? When the EAGAIN situation is returned from libssh2,
libcurl checks for what to wait for socket-wise and that shouldn't happen
immediately. Is libssh2 telling us to wait for the wrong direction in this
situation? Can you dig up more specific details on what exactly is happening?
    libssh2_session_set_blocking(sshc->ssh_session, 1);
After libcurl recompiling, file uploading by SFTP loads CPU to ~9%. And
looks like everything is ok: downloading/uploading files by SFTP work fine.
That's weird. libcurl is supposed to do that blocking magic on its own using
the same mechanism that libssh2 uses internally when blocking is selected. It
would then rather indicate that libssh2 _is_ right and that libcurl gets
confused for some reason.
Could you please tell me, can these changes cause any issues in libcurl
work?
Yes. It'll make libcurl work less like it is supposed to, as libcurl wants to
work non-blocking with the transfer-layer. It'll be most evident if you use
the multi interface.
Are there any plans to submit fix for this bug in the next version of
libcurl?
I'd like us to fix all bugs, yes. But fixing bugs is a lot of hard work and
we're all busy people. I hope we can help you track this down when you tell us
more details from what you experience.
Can the fix described above be included?
No, that's not a fix - it's a work-around that you can use in your local build
but it's not suitable for the mainline code.
--
/ daniel.haxx.se
Vourhey
2009-08-18 11:49:23 UTC
Permalink
Post by Daniel Stenberg
For example, he was using an older libssh2 than you. Are you also on Windows?
I use Fedora Linux and RHEL. It is reproducible issue.
Post by Daniel Stenberg
It does give me the feeling it might be a libssh2 bug, but without more
details we can't tell for sure.
Post by Vourhey
Infinite loop (invoking of clock_gettime, poll, recv functions) loads
CPU.
But why does it loop? When the EAGAIN situation is returned from
libssh2, libcurl checks for what to wait for socket-wise and that
shouldn't happen immediately. Is libssh2 telling us to wait for the
wrong direction in this situation? Can you dig up more specific details
on what exactly is happening?
Looks like it is not a libssh2 bug.

I've compiled 7.19.5 version of curl with the latest version of libssh2
(1.2). The test is simple (to upload a file to a remote host by SFTP):
src/curl -T ./big.iso -u anyuser:anypass sftp://my_host.com/~/
after that curl sends .iso file to the host. And gets 99% of CPU.

libssh2 opens a socket for sending the file:
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 4
file/socket descriptor is 4.

After that:
send(4,
"D\320kE\33\10\253:\241~O\261\2407l\346\313\2275^\313X\305\303RY\264Y\201foH|"...,
16452, MSG_NOSIGNAL) = 16452 //sending data from the file
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
unavailable) //there is no socket data
clock_gettime(CLOCK_MONOTONIC, {14965, 269959641}) = 0
clock_gettime(CLOCK_MONOTONIC, {14965, 270097791}) = 0
clock_gettime(CLOCK_MONOTONIC, {14965, 270215983}) = 0
poll([{fd=4, events=POLLOUT}], 1, 1000) = 1 ([{fd=4, revents=POLLOUT}])
poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}])
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
unavailable) //check again, but still nothing
clock_gettime(CLOCK_MONOTONIC, {14965, 270749652}) = 0
clock_gettime(CLOCK_MONOTONIC, {14965, 270877781}) = 0
clock_gettime(CLOCK_MONOTONIC, {14965, 270972324}) = 0
poll([{fd=4, events=POLLOUT}], 1, 1000) = 1 ([{fd=4, revents=POLLOUT}])
poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}])
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
unavailable) //check again, nothing
clock_gettime(CLOCK_MONOTONIC, {14965, 271424763}) = 0
. . .
and many calls of "clock_gettime" and "poll" functions until:
recv(4, "\207<{\317@\t\256\36|\34\33\24P{\0A\212q
\346\34_\323\270V\322\225\222-\213\243\237\237"..., 16384, MSG_NOSIGNAL) =
68 //the data is received from the socket

Then we read the file and send the next portion of data:
read(3, "\1CD001\1\0LINUX "..., 16384) = 16384 //read
file data
send(4,
"D\320kE\33\10\253:\241~O\261\2407l\346\313\2275^\313X\305\303RY\264Y\201foH|"...,
16452, MSG_NOSIGNAL) = 16452 //send
invoke clock_gettime, poll & recv again and again until the data is received
from the socket.

Looks like these functions: clock_gettime, poll & recv are invoked in
infinite loop, because the socket is non-blocking and “recv” syscall does
not wait for data from the socket but exits immediately.
Here is the result of the test "Who does use the most of CPU time":
% time calls errors syscall
------ ----------- ----------- --------- ---------
39.82 70527 clock_gettime
34.33 46339 poll
20.92 23251 23199 recv
1.10 38 send

clock_gettime, poll are invoked very often.
We can see the following backtraces:
clock_gettime is invoked:
#0 0x00a2ecf4 in clock_gettime () from /lib/librt.so.1
#1 0x0074cf69 in curlx_tvnow () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#2 0x00777842 in Curl_socket_ready () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#3 0x0076e8cf in Curl_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#4 0x0076f58d in curl_easy_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#5 0x0804fc74 in main ()

clock_gettime is invoked:
#0 0x00a2ecc6 in clock_gettime () from /lib/librt.so.1
#1 0x002f6f69 in curlx_tvnow () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#2 0x003176bd in Curl_readwrite () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#3 0x003188ea in Curl_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#4 0x0031958d in curl_easy_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#5 0x0804fc74 in main ()

poll is invoked:
#0 0x00e25402 in __kernel_vsyscall ()
#1 0x003fc93b in poll () from /lib/libc.so.6
#2 0x003218ce in Curl_socket_ready () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#3 0x003188cf in Curl_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#4 0x0031958d in curl_easy_perform () from
/home/nortel/curl-7.19.5/lib/.libs/libcurl.so.4
#5 0x0804fc74 in main ()

So, poll and clock_gettime are invoked from the libcurl.
Looks like these backtraces are performed by the following code in infinite
loop (lib/transfer.c file, Transfer() function):
switch (Curl_socket_ready(fd_read, fd_write, timeout_ms)) {
case -1: /* select() error, stop reading */
#ifdef EINTR
/* The EINTR is not serious, and it seems you might get this more
often when using the lib in a multi-threaded environment! */
if(SOCKERRNO == EINTR)
continue;
#endif
return CURLE_RECV_ERROR; /* indicate a network problem */
case 0: /* timeout */
default: /* readable descriptors */

result = Curl_readwrite(conn, &done);
/* "done" signals to us if the transfer(s) are ready */
break;
}
Post by Daniel Stenberg
libcurl is supposed to do that blocking magic on its own
using the same mechanism that libssh2 uses internally when blocking is
selected. It would then rather indicate that libssh2 _is_ right and that
libcurl gets confused for some reason.
Could you please provide info, how to enable this mechanism on libcurl
level? "--disable-nonblock" option for the configure does not help and curl
still loads CPU to 100%.
Post by Daniel Stenberg
Post by Vourhey
Could you please tell me, can these changes cause any issues in
libcurl work?
Yes. It'll make libcurl work less like it is supposed to, as libcurl
wants to work non-blocking with the transfer-layer. It'll be most
evident if you use the multi interface.
Blocking mode in the libssh2 is performed by non-blocked socket and the
select syscall. So, libcurl just waits for return from a libssh2 call
("libssh2_sftp_write", probably). After that it will invoke the same code.
Could you please tell me, what kind of issues can this cause?
Post by Daniel Stenberg
I'd like us to fix all bugs, yes. But fixing bugs is a lot of hard work
and we're all busy people. I hope we can help you track this down when
you tell us more details from what you experience.
I understand. You're doing very good job. Thanks a lot.
Please, see details above and feel free to comment.

Thanks,
Pavel
Daniel Stenberg
2009-08-18 12:12:41 UTC
Permalink
Post by Vourhey
send(4,
"D\320kE\33\10\253:\241~O\261\2407l\346\313\2275^\313X\305\303RY\264Y\201foH|"...,
16452, MSG_NOSIGNAL) = 16452 //sending data from the file
That's libssh2 doing the send()
Post by Vourhey
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
And libssh2 doing a recv(), since SFTP is a lot of back-and-forth packets even
while uploading
Post by Vourhey
poll([{fd=4, events=POLLOUT}], 1, 1000) = 1 ([{fd=4, revents=POLLOUT}])
This is the error. Even though it was recv() that returned EAGAIN and we
should wait until we can receive more data, this poll() waits for POLLOUT
which instead waits for the socket to be writeable. And it already is
writeable so it returns immediately...
Post by Vourhey
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
... but still not readable, so libssh2 returns EAGAIN at once again! :-/

(and on and on and on...)
Post by Vourhey
So, poll and clock_gettime are invoked from the libcurl.
Of course. But libssh2 tells libcurl what "direction" to wait for traffic, and
based on that info libcurl will poll() accordingly. Somehow this logic fails
and waits for the wrong action.

Do you by any chance have more than one libssh2 installation in your system?
I mean, when libcurl is built perhaps it thinks you have an older libssh2 and
thus doesn't use the correct direction-checking function (which was added in a
rather recent libssh2 version, I can't recall exactly which atm).
Post by Vourhey
libcurl is supposed to do that blocking magic on its own using the same
mechanism that libssh2 uses internally when blocking is selected. It would
then rather indicate that libssh2 _is_ right and that libcurl gets confused
for some reason.
Could you please provide info, how to enable this mechanism on libcurl
level?
It is already there and is used. What you see is a bug in that mechanism.
Post by Vourhey
Blocking mode in the libssh2 is performed by non-blocked socket and the
select syscall.
Yes, libssh2 always has the actual socket non-blocking even when it provides a
blocking API.
Post by Vourhey
So, libcurl just waits for return from a libssh2 call ("libssh2_sftp_write",
probably). After that it will invoke the same code. Could you please tell
me, what kind of issues can this cause?
It blocks libcurl until the entire call is complete, thus it prevents libcurl
from doing what it would normally do when the call is done in a non-blocking
manner. As I said, everything in libcurl works on the assumption that the
transport layer is non-blocking. For the exact details, just read the code or
test what happens.
--
/ daniel.haxx.se
Vourhey
2009-08-18 13:23:22 UTC
Permalink
Post by Vourhey
send(4,
Post by Vourhey
"D\320kE\33\10\253:\241~O\261\2407l\346\313\2275^\313X\305\303RY\264Y\201foH|"...,
16452, MSG_NOSIGNAL) = 16452 //sending data from the file
That's libssh2 doing the send()
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
And libssh2 doing a recv(), since SFTP is a lot of back-and-forth packets
even while uploading
poll([{fd=4, events=POLLOUT}], 1, 1000) = 1 ([{fd=4, revents=POLLOUT}])
This is the error. Even though it was recv() that returned EAGAIN and we
should wait until we can receive more data, this poll() waits for POLLOUT
which instead waits for the socket to be writeable. And it already is
writeable so it returns immediately...
recv(4, 0x827c7e4, 16384, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily
... but still not readable, so libssh2 returns EAGAIN at once again! :-/
(and on and on and on...)
So, poll and clock_gettime are invoked from the libcurl.
Of course. But libssh2 tells libcurl what "direction" to wait for traffic,
and based on that info libcurl will poll() accordingly. Somehow this logic
fails and waits for the wrong action.
Do you by any chance have more than one libssh2 installation in your
system? I mean, when libcurl is built perhaps it thinks you have an older
libssh2 and thus doesn't use the correct direction-checking function (which
was added in a rather recent libssh2 version, I can't recall exactly which
atm).
libcurl is supposed to do that blocking magic on its own using the same
Post by Vourhey
mechanism that libssh2 uses internally when blocking is selected. It would
then rather indicate that libssh2 _is_ right and that libcurl gets confused
for some reason.
Could you please provide info, how to enable this mechanism on libcurl
level?
It is already there and is used. What you see is a bug in that mechanism.
Blocking mode in the libssh2 is performed by non-blocked socket and the
Post by Vourhey
select syscall.
Yes, libssh2 always has the actual socket non-blocking even when it
provides a blocking API.
So, libcurl just waits for return from a libssh2 call
Post by Vourhey
("libssh2_sftp_write", probably). After that it will invoke the same code.
Could you please tell me, what kind of issues can this cause?
It blocks libcurl until the entire call is complete, thus it prevents
libcurl from doing what it would normally do when the call is done in a
non-blocking manner. As I said, everything in libcurl works on the
assumption that the transport layer is non-blocking. For the exact details,
just read the code or test what happens.
--
/ daniel.haxx.se
Thanks Daniel,
now I completely understand.

I use static linking to the libssh2. Old libssh2 can not cause any issues.
Daniel Stenberg
2009-08-18 15:55:04 UTC
Permalink
Post by Vourhey
I use static linking to the libssh2. Old libssh2 can not cause any issues.
Yes it could, if it would base the configure tests on an older version. But
Kamil said he can repeat it too so I think we can rule out the "old lib used
for test" as a probable cause.
--
/ daniel.haxx.se
Kamil Dudka
2009-08-20 13:55:12 UTC
Permalink
Hi Daniel,
Post by Daniel Stenberg
Post by Vourhey
I use static linking to the libssh2. Old libssh2 can not cause any issues.
Yes it could, if it would base the configure tests on an older version. But
Kamil said he can repeat it too so I think we can rule out the "old lib
used for test" as a probable cause.
I don't think it's a libssh2 bug. I've debugged it for a while and it seems
libssh2 is returning correct transfer direction info. However curl is not
willing to use the value.

libcurl saves the direction properly to conn->proto.sshc.waitfor, but the
variable is not accessed during transfer at all. The common layer (transfer.c)
then expects an outgoing transfer during upload while libssh2 is waiting for
ACK (incoming transfer).

The attached patch fixes the problem, but I am not going to apply it as it's
more likely a workaround than a fix. I think it should be somehow rewritten
to pass the direction info to functions from transfer.c to perform the poll.
Or am I completely wrong?

Note that it doesn't loop with SCP upload, but only with SFTP upload for me.

Kamil
Daniel Stenberg
2009-08-20 22:25:54 UTC
Permalink
Post by Kamil Dudka
I don't think it's a libssh2 bug. I've debugged it for a while and it seems
libssh2 is returning correct transfer direction info. However curl is not
willing to use the value.
libcurl saves the direction properly to conn->proto.sshc.waitfor, but the
variable is not accessed during transfer at all. The common layer
(transfer.c) then expects an outgoing transfer during upload while libssh2
is waiting for ACK (incoming transfer).
Really? What about ssh_block2waitfor() and ssh_perform_getsock() in lib/ssh.c?

ssh_block2waitfor() is supposed to figure out the reason for a "block"
(EAGAIN) and store that as a waitfor bitmask, and then ssh_perform_getsock()
uses that for the multi interface.

For the easy interface in ssh_easy_statemach(), libcurl uses
libssh2_session_block_directions() directly to figure out what to wait for.
Quite possibly that should rather use the 'waitfor' variable instead. Can you
compare the results from that libssh2_session_block_directions() call with the
contents of the waitfor variable to see if you can detect deviances during
transfer?
Post by Kamil Dudka
The attached patch fixes the problem, but I am not going to apply it as it's
more likely a workaround than a fix.
Right, since it will make the multi interface less non-blocking it isn't a
suitable.
Post by Kamil Dudka
Note that it doesn't loop with SCP upload, but only with SFTP upload for me.
Not too surprising really, SFTP is doing a lot more back-and-forth sending of
packets during a transfer than SCP so SFTP is more likely to end up in a
situation where it blocks in the "reverse" direction (like you're downloading
but get EAGAIN on the send() or vice versa).
--
/ daniel.haxx.se
Kamil Dudka
2009-08-21 07:14:40 UTC
Permalink
Post by Daniel Stenberg
Post by Kamil Dudka
I don't think it's a libssh2 bug. I've debugged it for a while and it
seems libssh2 is returning correct transfer direction info. However curl
is not willing to use the value.
libcurl saves the direction properly to conn->proto.sshc.waitfor, but the
variable is not accessed during transfer at all. The common layer
(transfer.c) then expects an outgoing transfer during upload while
libssh2 is waiting for ACK (incoming transfer).
Really? What about ssh_block2waitfor() and ssh_perform_getsock() in lib/ssh.c?
ssh_block2waitfor() is supposed to figure out the reason for a "block"
As far as I can tell this works properly.
Post by Daniel Stenberg
(EAGAIN) and store that as a waitfor bitmask, and then
ssh_perform_getsock() uses that for the multi interface.
I spotted this, though didn't test with the multi interface.
Post by Daniel Stenberg
For the easy interface in ssh_easy_statemach(), libcurl uses
This is not the place where libcurl is looping. ssh_easy_statemach() does not
run during the transfer at all. It is called only two times before the upload
is initiated. That's why I placed the workaround to Curl_sftp_send.

The looping situation looks following:

#1 sftp_packet_read (sftp=0x62b880) at sftp.c:176
#2 sftp_packet_require (sftp=0x62b880, packet_type=101 'e', request_id=2,
data=0x7fffffffd8a8, data_len=0x7fffffffd8b0)
at sftp.c:310
#3 sftp_write (handle=0x62ca70, buffer=0x61a269 "\353H\220\20\216м",
count=16384) at sftp.c:1486
#4 libssh2_sftp_write (hnd=0x62ca70, buffer=0x61a269 "\353H\220\20\216м",
count=16384) at sftp.c:1521
#5 Curl_sftp_send (conn=0x6202c0, sockindex=0, mem=0x61a269, len=16384) at
ssh.c:2753
#6 Curl_write (conn=0x6202c0, sockfd=10, mem=0x61a269, len=16384,
written=0x7fffffffda18) at sendf.c:287
#7 readwrite_upload (data=0x615ab0, conn=0x6202c0, k=0x615ad8,
didwhat=0x7fffffffda98) at transfer.c:1589
#8 Curl_readwrite (conn=0x6202c0, done=0x7fffffffdaff) at transfer.c:1692
#9 Transfer (conn=0x6202c0) at transfer.c:1953
#10 Curl_perform (data=0x615ab0) at transfer.c:2575
#11 curl_easy_perform (curl=0x615ab0) at easy.c:557

The command was:
curl -T ~/vm/rawhide.img -u xdudka00 --verbose sftp://sorento/~/
Post by Daniel Stenberg
libssh2_session_block_directions() directly to figure out what to wait for.
Quite possibly that should rather use the 'waitfor' variable instead. Can
you compare the results from that libssh2_session_block_directions() call
with the contents of the waitfor variable to see if you can detect
deviances during transfer?
Sure, the content is the same and seems ok to me, but it differs from
conn->data->req.keepon when looping.

Kamil
Daniel Stenberg
2009-08-21 07:42:30 UTC
Permalink
Post by Kamil Dudka
Post by Daniel Stenberg
libssh2_session_block_directions() directly to figure out what to wait for.
Quite possibly that should rather use the 'waitfor' variable instead. Can
you compare the results from that libssh2_session_block_directions() call
with the contents of the waitfor variable to see if you can detect
deviances during transfer?
Sure, the content is the same and seems ok to me, but it differs from
conn->data->req.keepon when looping.
Oh. Ouch. Hmmm. Grrr. Uuuuuh.

Yeah, I can see exactly what you're saying and of course this doesn't work as
intended. We need to make sure that the (ssh-) block-bits can be used by the
Curl_readwrite() function during the transfer phase as the 'keepon' direction
is too simple for this purpose. And we need to make it somewhat protocol
indepdent so that we don't just make a ssh kludge in the generic transfer.c
code.

I'll need to go back to staring at that code for a while more before I take a
stab at this. If anyone else does it first, by all means go ahead!
--
/ daniel.haxx.se
Kamil Dudka
2009-08-21 09:48:31 UTC
Permalink
Post by Daniel Stenberg
Yeah, I can see exactly what you're saying and of course this doesn't work
as intended. We need to make sure that the (ssh-) block-bits can be used by
the Curl_readwrite() function during the transfer phase as the 'keepon'
direction is too simple for this purpose. And we need to make it somewhat
protocol indepdent so that we don't just make a ssh kludge in the generic
transfer.c code.
I'll need to go back to staring at that code for a while more before I take
a stab at this. If anyone else does it first, by all means go ahead!
I am still a bit new to libcurl and not yet familiar with its design. What
about not making 'waitfor' as ssh-specific? Then its value can be checked
within Curl_readwrite() and performed the appropriate poll if necessary, not
sure how is the situation with other protocols...

Kamil
Daniel Stenberg
2009-08-21 10:56:13 UTC
Permalink
Post by Kamil Dudka
Post by Daniel Stenberg
I'll need to go back to staring at that code for a while more before I take
a stab at this. If anyone else does it first, by all means go ahead!
I am still a bit new to libcurl and not yet familiar with its design. What
about not making 'waitfor' as ssh-specific? Then its value can be checked
within Curl_readwrite() and performed the appropriate poll if necessary, not
sure how is the situation with other protocols...
Yes, that's along my line of thinking. We must of course make sure that we
only use this expanded 'waitfor' functionality for the protocols that actually
support it.

It would also provide a foundation to fix the problem with OpenSSL code and
the returned WANT codes that we discussed the other day.
--
/ daniel.haxx.se
Daniel Stenberg
2009-08-21 12:16:30 UTC
Permalink
Post by Daniel Stenberg
Post by Kamil Dudka
I am still a bit new to libcurl and not yet familiar with its design. What
about not making 'waitfor' as ssh-specific? Then its value can be checked
within Curl_readwrite() and performed the appropriate poll if necessary,
not sure how is the situation with other protocols...
Yes, that's along my line of thinking. We must of course make sure that we
only use this expanded 'waitfor' functionality for the protocols that
actually support it.
Okay, here's a first rough patch that shows a little what I had in mind. It
should be improved and cleaned up, and I didn't even test it with SFTP or
anything - I just ran a few hundred test cases that seemed to all work fine
still.
--
/ daniel.haxx.se
Daniel Stenberg
2009-08-21 12:25:23 UTC
Permalink
Post by Daniel Stenberg
Okay, here's a first rough patch that shows a little what I had in mind. It
should be improved and cleaned up, and I didn't even test it with SFTP or
anything - I just ran a few hundred test cases that seemed to all work fine
still.
Did I say it is rough? :-)

Here's a minor update that should make pause and bandwidth limiting etc to
still work too.
--
/ daniel.haxx.se
Kamil Dudka
2009-08-21 13:09:08 UTC
Permalink
Post by Daniel Stenberg
Post by Daniel Stenberg
Okay, here's a first rough patch that shows a little what I had in mind.
It should be improved and cleaned up, and I didn't even test it with SFTP
or anything - I just ran a few hundred test cases that seemed to all work
fine still.
Did I say it is rough? :-)
Here's a minor update that should make pause and bandwidth limiting etc to
still work too.
Looks good to me, though I haven't tested the patch yet. In fact it's almost
exactly the same what I was going to prepare as patch :-) What about creating
the 'waitfor' member within 'struct connectdata'? Then we could remove it
completely from 'struct ssh_conn' and let ssh.c stuff keep the value at only
one place. I'll test the patch ASAP.

Kamil
Daniel Stenberg
2009-08-21 13:59:34 UTC
Permalink
Post by Kamil Dudka
Post by Daniel Stenberg
Here's a minor update that should make pause and bandwidth limiting etc to
still work too.
Looks good to me, though I haven't tested the patch yet. In fact it's almost
exactly the same what I was going to prepare as patch :-) What about
creating the 'waitfor' member within 'struct connectdata'? Then we could
remove it completely from 'struct ssh_conn' and let ssh.c stuff keep the
value at only one place.
Yeah, that'll be a lot cleaner. We should do that.
Post by Kamil Dudka
I'll test the patch ASAP.
Great!
--
/ daniel.haxx.se
Kamil Dudka
2009-08-24 13:04:57 UTC
Permalink
Post by Daniel Stenberg
Post by Kamil Dudka
Post by Daniel Stenberg
Here's a minor update that should make pause and bandwidth limiting etc
to still work too.
Looks good to me, though I haven't tested the patch yet. In fact it's
almost exactly the same what I was going to prepare as patch :-) What
about creating the 'waitfor' member within 'struct connectdata'? Then we
could remove it completely from 'struct ssh_conn' and let ssh.c stuff
keep the value at only one place.
Yeah, that'll be a lot cleaner. We should do that.
Post by Kamil Dudka
I'll test the patch ASAP.
Great!
Bad news - the patch does not work. It ends up in a tight loop, now
unfortunately without transferring any data. The functions readwrite_data()
and readwrite_upload() were not called at all.

I was playing with that today. There were several flaws. But generally,
passing the 'waitfor' mask to 'keepon' within Curl_readwrite() can't help
since the call of Curl_socket_ready() is also non-blocking.

Is Curl_readwrite() really the function which should block? I think we should
solve the problem elsewhere.

Kamil
Daniel Stenberg
2009-08-24 13:37:05 UTC
Permalink
Post by Kamil Dudka
Bad news - the patch does not work. It ends up in a tight loop, now
unfortunately without transferring any data. The functions readwrite_data()
and readwrite_upload() were not called at all.
Hm, ok. Thanks for testing. Clearly too simple of a change. We need to do
somewhat more intrusive changes to get this rolling fine.
Post by Kamil Dudka
Is Curl_readwrite() really the function which should block? I think we
should solve the problem elsewhere.
No, Curl_readwrite() should not block but it should only check for traffic in
the desired direction(s) so I think the change is at least in the right
direction.

The block should be done in lib/transfer.c:Transfer() so that will also need
similar magic where it should get the 'waitfor' bits instead of the 'keepon'
set if they are indeed set.

Curl_single_getsock() also needs to get updated in the same spirit so that the
multi interface get improved this way too.
--
/ daniel.haxx.se
Kamil Dudka
2009-08-24 14:46:22 UTC
Permalink
Post by Daniel Stenberg
The block should be done in lib/transfer.c:Transfer() so that will also
need similar magic where it should get the 'waitfor' bits instead of the
'keepon' set if they are indeed set.
Curl_single_getsock() also needs to get updated in the same spirit so that
the multi interface get improved this way too.
What about the attached patch? It solves the problem for me, not sure if that
way is libcurl is supposed to work...

Kamil
Daniel Stenberg
2009-08-27 22:10:45 UTC
Permalink
Post by Kamil Dudka
What about the attached patch? It solves the problem for me, not sure if
that way is libcurl is supposed to work...
Not bad, not bad at all. But I didn't like the #ifdefs and the second socket
handling for the ssh transfers could be made better. How about this updated
version (based on yours) ?
--
/ daniel.haxx.se
Kamil Dudka
2009-08-27 22:56:31 UTC
Permalink
Post by Daniel Stenberg
Post by Kamil Dudka
What about the attached patch? It solves the problem for me, not sure if
that way is libcurl is supposed to work...
Not bad, not bad at all. But I didn't like the #ifdefs and the second
socket handling for the ssh transfers could be made better. How about this
updated version (based on yours) ?
Thanks for being patient. I don't like #ifdefs, too. We should avoid it as
Post by Daniel Stenberg
diff -rup curl.k/lib/ssh.c curl.d/lib/ssh.c
--- curl.k/lib/ssh.c 2009-08-28 00:18:04.383590721 +0200
+++ curl.d/lib/ssh.c 2009-08-28 00:18:31.550591905 +0200
@@ -1500,7 +1500,7 @@ static CURLcode ssh_statemach_act(struct
Curl_pgrsSetUploadSize(data, data->set.infilesize);
}
/* upload data */
- result = Curl_setup_transfer(conn, -1, -1, FALSE, NULL,
+ result = Curl_setup_transfer(conn, FIRSTSOCKET, -1, FALSE, NULL,
FIRSTSOCKET, NULL);
if(result) {
This breaks SFTP upload completely:
...
* Initialized SSH public key authentication
* Authentication complete
* Closing connection #0
* Failure when receiving data from the peer
Post by Daniel Stenberg
diff -rup curl.k/lib/transfer.c curl.d/lib/transfer.c
--- curl.k/lib/transfer.c 2009-08-28 00:18:04.383590721 +0200
+++ curl.d/lib/transfer.c 2009-08-28 00:18:31.554611830 +0200
@@ -1652,10 +1652,6 @@ CURLcode Curl_readwrite(struct connectda
if((k->keepon & KEEP_RECVBITS) == KEEP_RECV) {
fd_read = conn->sockfd;
-#if defined(USE_LIBSSH2)
- if(conn->protocol & (PROT_SCP|PROT_SFTP))
- select_res |= CURL_CSELECT_IN;
-#endif /* USE_LIBSSH2 */
} else
fd_read = CURL_SOCKET_BAD;
This breaks SFTP *download*. Anyway it deserves a comment:

1) Why has been the exception here?

2) Why are you going to drop it right now?

Kamil
Daniel Stenberg
2009-08-28 21:52:44 UTC
Permalink
Post by Kamil Dudka
Post by Daniel Stenberg
- result = Curl_setup_transfer(conn, -1, -1, FALSE, NULL,
+ result = Curl_setup_transfer(conn, FIRSTSOCKET, -1, FALSE, NULL,
FIRSTSOCKET, NULL);
I guess I need to stop blindly editing code and suggesting changes without
testing myself first....

The idea of mine is that both upload and download of SSH-based protocols need
both read and write sockets set since both directions will be needed.
Post by Kamil Dudka
Post by Daniel Stenberg
-#if defined(USE_LIBSSH2)
- if(conn->protocol & (PROT_SCP|PROT_SFTP))
- select_res |= CURL_CSELECT_IN;
-#endif /* USE_LIBSSH2 */
} else
fd_read = CURL_SOCKET_BAD;
1) Why has been the exception here?
I don't remember, but I experience pain when I see it. It seems like a very
weird assumption and action. I would think our waitfor bits approach should be
a much better way to achieve something similar.
Post by Kamil Dudka
2) Why are you going to drop it right now?
Simply because that old #ifdef is related to the same problem we're working
on, and with a "proper" fix going on I think it is also motivated to get rid
of old dirty kludges.
--
/ daniel.haxx.se
Kamil Dudka
2009-08-30 23:07:47 UTC
Permalink
Post by Daniel Stenberg
Post by Kamil Dudka
1) Why has been the exception here?
I don't remember, but I experience pain when I see it. It seems like a very
weird assumption and action. I would think our waitfor bits approach should
be a much better way to achieve something similar.
Post by Kamil Dudka
2) Why are you going to drop it right now?
Simply because that old #ifdef is related to the same problem we're working
on, and with a "proper" fix going on I think it is also motivated to get
rid of old dirty kludges.
Attached is my current version of the patch. There are no #ifdefs and it seems
to work. No extensive testing yet, just a simple sftp/scp download/upload
with the curl tool. I haven't actually had enough time to dive into code and
discover details causing this to work. I am only sharing my up-to-now results
to avoid duplicated work on the issue.

Kamil
Daniel Stenberg
2009-09-02 22:03:33 UTC
Permalink
Post by Kamil Dudka
Attached is my current version of the patch. There are no #ifdefs and it
seems to work. No extensive testing yet, just a simple sftp/scp
download/upload with the curl tool.
I like how this is turning, but unfortunately test 603 fails for me every time
now with this applied while it runs fine without it...

I'll try to investigate a bit further soonish why it does this.

(using libssh2 from an up-to-date git repo)
--
/ daniel.haxx.se
Kamil Dudka
2009-09-03 12:56:54 UTC
Permalink
Post by Daniel Stenberg
I like how this is turning, but unfortunately test 603 fails for me every
time now with this applied while it runs fine without it...
I'll try to investigate a bit further soonish why it does this.
(using libssh2 from an up-to-date git repo)
I would be happy to play with the test suite instead, but these tests are
always skipped for me. Any idea why?

Attached is output of:
$ ./runtests.pl 603 > test603.out

It's weird because I am able to start the server manually:
$ ./sshserver.pl -v
ssh server found /usr/sbin/sshd is OpenSSH 5.2.0
sftp server plugin found /usr/libexec/openssh/sftp-server
sftp client found /usr/bin/sftp
ssh keygen found /usr/bin/ssh-keygen
ssh client found /usr/bin/ssh is OpenSSH 5.2.0
generating host keys...
generating client keys...
generating ssh server config file...
generating ssh client known hosts file...
generating ssh client config file...
generating sftp client config file...
generating sftp client commands file...
SCP/SFTP server listening on port 8999

$ telnet 127.0.0.1 8999
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
SSH-2.0-OpenSSH_5.2

No chance for me to debug those scripts since I've never read/written a byte
in perl :-(

Kamil
Kamil Dudka
2009-09-03 13:31:25 UTC
Permalink
Post by Kamil Dudka
Post by Daniel Stenberg
I like how this is turning, but unfortunately test 603 fails for me every
time now with this applied while it runs fine without it...
I'll try to investigate a bit further soonish why it does this.
(using libssh2 from an up-to-date git repo)
I would be happy to play with the test suite instead, but these tests are
always skipped for me. Any idea why?
$ ./runtests.pl 603 > test603.out
Looking at the output once again I can see it's related to SELinux. The test
passes for me with the previous patch (with #ifdefs) and SELinux set to
permissive mode.

Kamil
Kamil Dudka
2009-09-04 13:10:02 UTC
Permalink
Post by Kamil Dudka
Post by Kamil Dudka
Post by Daniel Stenberg
I like how this is turning, but unfortunately test 603 fails for me
every time now with this applied while it runs fine without it...
I'll try to investigate a bit further soonish why it does this.
(using libssh2 from an up-to-date git repo)
I would be happy to play with the test suite instead, but these tests are
always skipped for me. Any idea why?
$ ./runtests.pl 603 > test603.out
Looking at the output once again I can see it's related to SELinux. The
test passes for me with the previous patch (with #ifdefs) and SELinux set
to permissive mode.
The SELinux problem has been resolved:
https://bugzilla.redhat.com/show_bug.cgi?id=521087

Now the test passes for me with both (old and current) versions of the patch.
Could you please attach the log of the failing test? Thanks in advance!

Kamil
Kamil Dudka
2009-09-16 14:34:04 UTC
Permalink
Post by Daniel Stenberg
Post by Kamil Dudka
Attached is my current version of the patch. There are no #ifdefs and it
seems to work. No extensive testing yet, just a simple sftp/scp
download/upload with the curl tool.
I like how this is turning, but unfortunately test 603 fails for me every
time now with this applied while it runs fine without it...
My fault, I was running the test suite against the libcurl installed on my
system instead of the just compiled one. That's why the test passed for me.
The bug is fixed in the attached patch. Here are my test results:
$ ./runtests.pl -k SCP SFTP

http://kdudka.fedorapeople.org/curl-poll-testlog.tar.lzma
Post by Daniel Stenberg
(using libssh2 from an up-to-date git repo)
Also tested with the latest libssh2...

Kamil
Kamil Dudka
2009-09-16 15:53:56 UTC
Permalink
Post by Kamil Dudka
Post by Daniel Stenberg
Post by Kamil Dudka
Attached is my current version of the patch. There are no #ifdefs and
it seems to work. No extensive testing yet, just a simple sftp/scp
download/upload with the curl tool.
I like how this is turning, but unfortunately test 603 fails for me every
time now with this applied while it runs fine without it...
My fault, I was running the test suite against the libcurl installed on my
system instead of the just compiled one. That's why the test passed for me.
$ ./runtests.pl -k SCP SFTP
http://kdudka.fedorapeople.org/curl-poll-testlog.tar.lzma
Post by Daniel Stenberg
(using libssh2 from an up-to-date git repo)
Also tested with the latest libssh2...
It still doesn't seem to be perfect. I've noticed several failing tests [1]
on koji, not yet able to reproduce the failures locally. I'll check tomorrow
what has happend.

Kamil

[1] http://koji.fedoraproject.org/koji/taskinfo?taskID=1683148
Kamil Dudka
2009-09-16 18:47:38 UTC
Permalink
Post by Kamil Dudka
It still doesn't seem to be perfect. I've noticed several failing tests [1]
on koji, not yet able to reproduce the failures locally. I'll check
tomorrow what has happend.
Nope, it was false alarm! Fedora 12 build root seems to be broken.

It fails occasionally, no matter if the polling patch is applied or not. But
the same SRPM runs properly with/without the patch on Fedora 11 build root.
Please ignore the previous mail.

Any feedback for the patch is welcome

Kamil

Kamil Dudka
2009-08-18 12:27:41 UTC
Permalink
Post by Vourhey
Post by Daniel Stenberg
For example, he was using an older libssh2 than you. Are you also on Windows?
I use Fedora Linux and RHEL. It is reproducible issue.
Thanks for the report! I can confirm the bug with the latest Fedora (lib)curl.
Willing to open a ticket at RHBZ to track this down? I'll give it a try tmrw
and hopefully come with some new info.

Kamil
Loading...