Commit Graph

195 Commits

Author SHA1 Message Date
Ilpo Järvinen
c96fd3d461 [TCP]: Enable SACK enhanced FRTO (RFC4138) by default
Most of the description that follows comes from my mail to
netdev (some editing done):

Main obstacle to FRTO use is its deployment as it has to be on
the sender side where as wireless link is often the receiver's
access link. Take initiative on behalf of unlucky receivers and
enable it by default in future Linux TCP senders. Also IETF
seems to interested in advancing FRTO from experimental [1].

How does FRTO help?
===================

FRTO detects spurious RTOs and avoids a number of unnecessary
retransmissions and a couple of other problems that can arise
due to incorrect guess made at RTO (i.e., that segments were
lost when they actually got delayed which is likely to occur
e.g. in wireless environments with link-layer retransmission).
Though FRTO cannot prevent the first (potentially unnecessary)
retransmission at RTO, I suspect that it won't cost that much
even if you have to pay for each bit (won't be that high
percentage out of all packets after all :-)). However, usually
when you have a spurious RTO, not only the first segment
unnecessarily retransmitted but the *whole window*. It goes like
this: all cumulative ACKs got delayed due to in-order delivery,
then TCP will actually send 1.5*original cwnd worth of data in
the RTO's slow-start when the delayed ACKs arrive (basically the
original cwnd worth of it unnecessarily). In case one is
interested in minimizing unnecessary retransmissions e.g. due to
cost, those rexmissions must never see daylight. Besides, in the
worst case the generated burst overloads the bottleneck buffers
which is likely to significantly delay the further progress of
the flow. In case of ll rexmissions, ACK compression often
occurs at the same time making the burst very "sharp edged" (in
that case TCP often loses most of the segments above high_seq
=> very bad performance too). When FRTO is enabled, those
unnecessary retransmissions are fully avoided except for the
first segment and the cwnd behavior after detected spurious RTO
is determined by the response (one can tune that by sysctl).

Basic version (non-SACK enhanced one), FRTO can fail to detect
spurious RTO as spurious and falls back to conservative
behavior. ACK lossage is much less significant than reordering,
usually the FRTO can detect spurious RTO if at least 2
cumulative ACKs from original window are preserved (excluding
the ACK that advances to high_seq). With SACK-enhanced version,
the detection is quite robust.

FRTO should remove the need to set a high lower bound for the
RTO estimator due to delay spikes that occur relatively common
in some environments (esp. in wireless/cellular ones).

[1] http://www1.ietf.org/mail-archive/web/tcpm/current/msg02862.html

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:12 -07:00
Ilpo Järvinen
009a2e3e4e [TCP] FRTO: Improve interoperability with other undo_marker users
Basically this change enables it, previously other undo_marker
users were left with nothing. Reverse undo_marker logic
completely to get it set right in CA_Loss. On the other hand,
when spurious RTO is detected, clear it. Clearing might be too
heavy for some scenarios but seems safe enough starting point
for now and shouldn't have much effect except in majority of
cases (if in any).

By adding a new FLAG_ we avoid looping through write_queue when
RTO occurs.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:11 -07:00
Ilpo Järvinen
7c46a03e67 [TCP]: Cleanup tcp_tso_acked and tcp_clean_rtx_queue
Implements following cleanups:
- Comment re-placement (CodingStyle)
- tcp_tso_acked() local (wrapper-like) variable removal
  (readability)
- __-types removed (IMHO they make local variables jumpy looking
  and just was space)
- acked -> flag (naming conventions elsewhere in TCP code)
- linebreak adjustments (readability)
- nested if()s combined (reduced indentation)
- clarifying newlines added

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:10 -07:00
Ilpo Järvinen
13fcf850cc [TCP]: Move accounting from tso_acked to clean_rtx_queue
The accounting code is pretty much the same, so it's a shame
we do it in two places.

I'm not too sure if added fully_acked check in MTU probing is
really what we want perhaps the added end_seq could be used in
the after() comparison.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:09 -07:00
Ilpo Järvinen
5af4ec236f [TCP]: clear_all_retrans_hints prefixed by tcp_
In addition, fix its function comment spacing.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
2007-10-10 16:52:09 -07:00
Ilpo Järvinen
91fed7a15c [TCP]: Make fackets_out accurate
Substraction for fackets_out is unconditional when snd_una
advances, thus there's no need to do it inside the loop. Just
make sure correct bounds are honored.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:08 -07:00
Ilpo Järvinen
18f02545a9 [TCP] MIB: Add counters for discarded SACK blocks
In DSACK case, some events are not extraordinary, such as packet
duplication generated DSACK. They can arrive easily below
snd_una when undo_marker is not set (TCP being in CA_Open),
counting such DSACKs amoung SACK discards will likely just
mislead if they occur in some scenario when there are other
problems as well. Similarly, excessively delayed packets could
cause "normal" DSACKs. Therefore, separate counters are
allocated for DSACK events.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:48:30 -07:00
Ilpo Järvinen
5b3c98821a [TCP]: Discard fuzzy SACK blocks
SACK processing code has been a sort of russian roulette as no
validation of SACK blocks is previously attempted. Besides, it
is not very clear what all kinds of broken SACK blocks really
mean (e.g., one that has start and end sequence numbers
reversed). So now close the roulette once and for all.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:48:29 -07:00
Ilpo Järvinen
6728e7dc3e [TCP]: Rename tcp_ack_packets_out -> tcp_rearm_rto
Only thing that tiny function does is rearming the RTO (if
necessary), name it accordingly.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:48:28 -07:00
Ilpo Järvinen
e9144bd8da [TCP]: Remove unnecessary wrapper tcp_packets_out_dec
Makes caller side more obvious, there's no need to have
a wrapper for this oneliner!

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:48:27 -07:00
Ilpo Järvinen
e60402d0a9 [TCP]: Move sack_ok access to obviously named funcs & cleanup
Previously code had IsReno/IsFack defined as macros that were
local to tcp_input.c though sack_ok field has user elsewhere too
for the same purpose. This changes them to static inlines as
preferred according the current coding style and unifies the
access to sack_ok across multiple files. Magic bitops of sack_ok
for FACK and DSACK are also abstracted to functions with
appropriate names.

Note:
- One sack_ok = 1 remains but that's self explanary, i.e., it
  enables sack
- Couple of !IsReno cases are changed to tcp_is_sack
- There were no users for IsDSack => I dropped it

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:48:00 -07:00
Ilpo Järvinen
1b6d427bb7 [TCP]: Reduce sacked_out with reno when purging write_queue
Previously TCP had a transitional state during which reno
counted segments that are already below the current window into
sacked_out, which is now prevented. In addition, re-try now
the unconditional S+L skb catching.

This approach conservatively calls just remove_sack and leaves
reset_sack() calls alone. The best solution to the whole problem
would be to first calculate the new sacked_out fully (this patch
does not move reno_sack_reset calls from original sites and thus
does not implement this). However, that would require very
invasive change to fastretrans_alert (perhaps even slicing it to
two halves). Alternatively, all callers of tcp_packets_in_flight
(i.e., users that depend on sacked_out) should be postponed
until the new sacked_out has been calculated but it isn't any
simpler alternative.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:58 -07:00
Ilpo Järvinen
d02596e329 [TCP]: Keep state in Disorder also if only lost_out > 0
This happens rather infrequently and is only possible during
FRTO. We must not allow TCP to slip to Open state because
tcp_fastretrans_alert might then not be called on it's time
when FRTO has exited. This become a problem when left_out
got removed and was replaced by just sacked_out.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:58 -07:00
Ilpo Järvinen
86426c22d2 [TCP]: Restore over-zealous tcp_sync_left_out-like removals
tcp_verify_left_out is useful for verifying S+L condition, so
add it back to couple of places in where the code was not
calling to tcp_sync_left_out but used own ad-hoc solution
(before the tcp_sync_left_out got removed).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:57 -07:00
Ilpo Järvinen
005903bc3a [TCP]: Left out sync->verify (the new meaning of it) & definify
Left_out was dropped a while ago, thus leaving verifying
consistency of the "left out" as only task for the function in
question. Thus make it's name more appropriate.

In addition, it is intentionally converted to #define instead
of static inline because the location of the invariant failure
is the most important thing to have if this ever triggers. I
think it would have been helpful e.g. in this case where the
location of the failure point had to be based on some quesswork:
    http://lkml.org/lkml/2007/5/2/464
...Luckily the guesswork seems to have proved to be correct.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:57 -07:00
Ilpo Järvinen
83ae40885f [TCP]: Add tcp_left_out(tp) "back" to get cleaner looking lines
tp->left_out got removed but nothing came to replace it back
then (users just did addition by themselves), so add function
for users now.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:56 -07:00
Ilpo Järvinen
b5860bbac7 [TCP]: Tighten tcp_sock's belt, drop left_out
It is easily calculable when needed and user are not that many
after all.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:55 -07:00
Ilpo Järvinen
bdf1ee5d3b [TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it
No other users exist for tcp_ecn.h. Very few things remain in
tcp.h, for most TCP ECN functions callers reside within a
single .c file and can be placed there.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:54 -07:00
Ilpo Järvinen
9bff40fda0 [TCP] FRTO: remove unnecessary fackets/sacked_out recounting
F-RTO does not touch SACKED_ACKED bits at all, so there is no
need to recount them in tcp_enter_frto_loss. After removal of
the else branch, nested ifs can be combined.

This must also reset sacked_out when SACK is not in use as TCP
could have received some duplicate ACKs prior RTO. To achieve
that in a sane manner, tcp_reset_reno_sack was re-placed by the
previous patch.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:53 -07:00
Ilpo Järvinen
4ddf66769d [TCP]: Move Reno SACKed_out counter functions earlier
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:52 -07:00
David S. Miller
d06e021d71 [TCP]: Extract DSACK detection code from tcp_sacktag_write_queue().
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:51 -07:00
Ilpo Järvinen
19b2b48658 [TCP]: Rexmit hint must be cleared instead of setting it
Stupid error from my side. Even though now that I noticed this,
I hoped it would have been an optimization but no, the counter
hint is then incorrect. Thus clearing is necessary for now (I
still suspect though that this path is never executed).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:51 -07:00
Ilpo Järvinen
d8f4f2235a [TCP]: Extracted rexmit hint clearing from the LOST marking code
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:50 -07:00
Ilpo Järvinen
d738cd8fca [TCP]: Add highest_sack seqno, points to globally highest SACK
It is guaranteed to be valid only when !tp->sacked_out. In most
cases this seqno is available in the last ACK but there is no
guarantee for that. The new fast recovery loss marking algorithm
needs this as entry point.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:47:50 -07:00
Ilpo Järvinen
48611c47d0 [TCP]: Fix fastpath_cnt_hint when GSO skb is partially ACKed
When only GSO skb was partially ACKed, no hints are reset,
therefore fastpath_cnt_hint must be tweaked too or else it can
corrupt fackets_out. The corruption to occur, one must have
non-trivial ACK/SACK sequence, so this bug is not very often
that harmful. There's a fackets_out state reset in TCP because
fackets_out is known to be inaccurate and that fixes the issue
eventually anyway.

In case there was also at least one skb that got fully ACKed,
the fastpath_skb_hint is set to NULL which causes a recount for
fastpath_cnt_hint (the old value won't be accessed anymore),
thus it can safely be decremented without additional checking.

Reported by Cedric Le Goater <clg@fr.ibm.com>

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-07 23:43:10 -07:00
David S. Miller
5c127c58ae [TCP]: 'dst' can be NULL in tcp_rto_min()
Reported by Rick Jones.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-08-31 14:39:44 -07:00
David S. Miller
05bb1fad1c [TCP]: Allow minimum RTO to be configurable via routing metrics.
Cell phone networks do link layer retransmissions and other
things that cause unnecessary timeout retransmits.  So allow
the minimum RTO to be inflated per-route to deal with this.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-08-30 22:10:28 -07:00
David S. Miller
26722873a4 [TCP]: Describe tcp_init_cwnd() thoroughly in a comment.
People often get tripped up by this function and think that
it does not implemented the prescribed algorithms from
RFC2414 and RFC3390, even though it does.

So add a comment to head off such misunderstandings in the
future.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-08-26 18:35:36 -07:00
Ilpo Järvinen
49ff4bb4cd [TCP]: DSACK signals data receival, be conservative
In case a DSACK is received, it's better to lower cwnd as it's
a sign of data receival.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-08-02 19:47:59 -07:00
Ilpo Järvinen
2e6052941a [TCP]: Also handle snd_una changes in tcp_cwnd_down
tcp_cwnd_down must check for it too as it should be conservative
in case of collapse stuff and also when receiver is trying to
lie (though that wouldn't be very successful/useful anyway).

Note:
- Separated also is_dupack and do_lost in fast_retransalert
	* Much cleaner look-and-feel now
	* This time it really fixes cumulative ACK with many new
	  SACK blocks recovery entry (I claimed this fixes with
	  last patch but it wasn't). TCP will now call
	  tcp_update_scoreboard regardless of is_dupack when
	  in recovery as long as there is enough fackets_out.
- Introduce FLAG_SND_UNA_ADVANCED
	* Some prior_snd_una arguments are unnecessary after it
- Added helper FLAG_ANY_PROGRESS to avoid long FLAG...|FLAG...
  constructs

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-08-02 19:46:58 -07:00
Ilpo Järvinen
b8ed601cef [TCP]: Bidir flow must not disregard SACK blocks for lost marking
It's possible that new SACK blocks that should trigger new LOST
markings arrive with new data (which previously made is_dupack
false). In addition, I think this fixes a case where we get
a cumulative ACK with enough SACK blocks to trigger the fast
recovery (is_dupack would be false there too).

I'm not completely pleased with this solution because readability
of the code is somewhat questionable as 'is_dupack' in SACK case
is no longer about dupacks only but would mean something like
'lost_marker_work_todo' too... But because of Eifel stuff done
in CA_Recovery, the FLAG_DATA_SACKED check cannot be placed to
the if statement which seems attractive solution. Nevertheless,
I didn't like adding another variable just for that either... :-)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-31 02:28:31 -07:00
Ilpo Järvinen
1e757f9996 [TCP]: Fix ratehalving with bidirectional flows
Actually, the ratehalving seems to work too well, as cwnd is
reduced on every second ACK even though the packets in flight
remains unchanged. Recoveries in a bidirectional flows suffer
quite badly because of this, both NewReno and SACK are affected.

After this patch, rate halving is performed for ACK only if
packets in flight was supposedly changed too.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-31 02:28:30 -07:00
Stephen Hemminger
30cfd0baf0 [TCP]: congestion control API pass RTT in microseconds
This patch changes the API for the callback that is done after an ACK is
received. It solves a couple of issues:

  * Some congestion controls want higher resolution value of RTT
    (controlled by TCP_CONG_RTT_SAMPLE flag). These don't really want a ktime, but
    all compute a RTT in microseconds.

  * Other congestion control could use RTT at jiffies resolution.

To keep API consistent the units should be the same for both cases, just the
resolution should change.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-31 02:27:57 -07:00
Stephen Hemminger
16751347a0 [TCP]: remove unused argument to cong_avoid op
None of the existing TCP congestion controls use the rtt value pased
in the ca_ops->cong_avoid interface.  Which is lucky because seq_rtt
could have been -1 when handling a duplicate ack.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18 01:46:58 -07:00
Ilpo Järvinen
0a9f2a467d [TCP]: Verify the presence of RETRANS bit when leaving FRTO
For yet unknown reason, something cleared SACKED_RETRANS bit
underneath FRTO.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-15 00:19:29 -07:00
Ilpo Järvinen
7769f4064c [TCP]: Fix logic breakage due to DSACK separation
Commit 6f74651ae6 is found guilty
of breaking DSACK counting, which should be done only for the
SACK block reported by the DSACK instead of every SACK block
that is received along with DSACK information.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-15 15:14:04 -07:00
Ilpo Järvinen
b9ce204f0a [TCP]: Congestion control API RTT sampling fix
Commit 164891aadf broke RTT
sampling of congestion control modules. Inaccurate timestamps
could be fed to them without providing any way for them to
identify such cases. Previously RTT sampler was called only if
FLAG_RETRANS_DATA_ACKED was not set filtering inaccurate
timestamps nicely. In addition, the new behavior could give an
invalid timestamp (zero) to RTT sampler if only skbs with
TCPCB_RETRANS were ACKed. This solves both problems.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-15 15:08:43 -07:00
Ilpo Järvinen
d7ea5b91fa [TCP]: Add missing break to TCP option parsing code
This flaw does not affect any behavior (currently).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-14 12:58:26 -07:00
Ilpo Järvinen
af15cc7b85 [TCP]: Fix left_out setting during FRTO
Without FRTO, the tcp_try_to_open is never called with
lost_out > 0 (see tcp_time_to_recover). However, when FRTO is
enabled, the !tp->lost condition is not used until end of FRTO
because that way TCP avoids premature entry to fast recovery
during FRTO.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-12 16:16:44 -07:00
Ilpo Järvinen
6418204f91 [TCP]: Fix GSO ignorance of pkts_acked arg (cong.cntrl modules)
The code used to ignore GSO completely, passing either way too
small or zero pkts_acked when GSO skb or part of it got ACKed.
In addition, there is no need to calculate the value in the loop
but simple arithmetics after the loop is sufficient. There is
no need to handle SYN case specially because congestion control
modules are not yet initialized when FLAG_SYN_ACKED is set.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-03 18:08:48 -07:00
Ilpo Järvinen
580e572a4a [TCP] FRTO: Prevent state inconsistency in corner cases
State could become inconsistent in two cases:

1) Userspace disabled FRTO by tuning sysctl when one of the TCP
   flows was in the middle of FRTO algorithm (and then RTO is
   again triggered)

2) SACK reneging occurs during FRTO algorithm

A simple solution is just to abort the previous FRTO when such
obscure condition occurs...

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-19 13:56:57 -07:00
Ilpo Järvinen
463236557d [TCP] FRTO: Add missing ECN CWR sending to one of the responses
The conservative spurious RTO response did not queue CWR even
though the sending rate was lowered. Whenever reduction happens
regardless of reason, CWR should be sent (forgetting to send it
is not very fatal though).

A better approach would be to queue CWR when one of the sending
rate reducing responses (rate-halving one or this conservative
response) is used already at RTO. Doing that would allow CWR to
be sent along with the two new data segments that are sent
during FRTO. However, it's a bit "racy" because userland could
tune the response sysctl to a more aggressive one in between.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-19 13:56:23 -07:00
Ilpo Järvinen
d551e4541d [TCP] FRTO: RFC4138 allows Nagle override when new data must be sent
This is a corner case where less than MSS sized new data thingie
is awaiting in the send queue. For F-RTO to work correctly, a
new data segment must be sent at certain point or F-RTO cannot
be used at all. RFC4138 allows overriding of Nagle at that
point.

Implementation uses frto_counter states 2 and 3 to distinguish
when Nagle override is needed.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-30 00:58:16 -07:00
Ilpo Järvinen
575ee7140d [TCP] FRTO: Delay skb available check until it's mandatory
No new data is needed until the first ACK comes, so no need to check
for application limitedness until then.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-30 00:58:12 -07:00
Stephen Hemminger
164891aadf [TCP]: Congestion control API update.
Do some simple changes to make congestion control API faster/cleaner.
* use ktime_t rather than timeval
* merge rtt sampling into existing ack callback
  this means one indirect call versus two per ack.
* use flags bits to store options/settings

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:45 -07:00
Ilpo Järvinen
9e412ba763 [TCP]: Sed magic converts func(sk, tp, ...) -> func(sk, ...)
This is (mostly) automated change using magic:

sed -e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N'
    -e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N'
    -e 's|struct sock \*sk,[\n\t ]*struct tcp_sock \*tp\([^{]*\n{\n\)|
	  struct sock \*sk\1\tstruct tcp_sock *tp = tcp_sk(sk);\n|g'
    -e 's|struct sock \*sk, struct tcp_sock \*tp|
	  struct sock \*sk|g' -e 's|sk, tp\([^-]\)|sk\1|g'

Fixed four unused variable (tp) warnings that were introduced.

In addition, manually added newlines after local variables and
tweaked function arguments positioning.

$ gcc --version
gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1)
...
$ codiff -fV built-in.o.old built-in.o.new
net/ipv4/route.c:
  rt_cache_flush |  +14
 1 function changed, 14 bytes added

net/ipv4/tcp.c:
  tcp_setsockopt |   -5
  tcp_sendpage   |  -25
  tcp_sendmsg    |  -16
 3 functions changed, 46 bytes removed

net/ipv4/tcp_input.c:
  tcp_try_undo_recovery |   +3
  tcp_try_undo_dsack    |   +2
  tcp_mark_head_lost    |  -12
  tcp_ack               |  -15
  tcp_event_data_recv   |  -32
  tcp_rcv_state_process |  -10
  tcp_rcv_established   |   +1
 7 functions changed, 6 bytes added, 69 bytes removed, diff: -63

net/ipv4/tcp_output.c:
  update_send_head          |   -9
  tcp_transmit_skb          |  +19
  tcp_cwnd_validate         |   +1
  tcp_write_wakeup          |  -17
  __tcp_push_pending_frames |  -25
  tcp_push_one              |   -8
  tcp_send_fin              |   -4
 7 functions changed, 20 bytes added, 63 bytes removed, diff: -43

built-in.o.new:
 18 functions changed, 40 bytes added, 178 bytes removed, diff: -138

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:34 -07:00
Stephen Hemminger
3ff50b7997 [NET]: cleanup extra semicolons
Spring cleaning time...

There seems to be a lot of places in the network code that have
extra bogus semicolons after conditionals.  Most commonly is a
bogus semicolon after: switch() { }

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:24 -07:00
Herbert Xu
604763722c [NET]: Treat CHECKSUM_PARTIAL as CHECKSUM_UNNECESSARY
When a transmitted packet is looped back directly, CHECKSUM_PARTIAL
maps to the semantics of CHECKSUM_UNNECESSARY.  Therefore we should
treat it as such in the stack.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:43 -07:00
Arnaldo Carvalho de Melo
9c70220b73 [SK_BUFF]: Introduce skb_transport_header(skb)
For the places where we need a pointer to the transport header, it is
still legal to touch skb->h.raw directly if just adding to,
subtracting from or setting it to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:31 -07:00
Arnaldo Carvalho de Melo
aa8223c7bb [SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:26 -07:00
Arnaldo Carvalho de Melo
967b05f64e [SK_BUFF]: Introduce skb_set_transport_header
For the cases where the transport header is being set to a offset from
skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:17 -07:00
Arnaldo Carvalho de Melo
c14d2450cb [SK_BUFF]: Introduce skb_set_network_header
For the cases where the network header is being set to a offset from skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:01 -07:00
Arnaldo Carvalho de Melo
d56f90a7c9 [SK_BUFF]: Introduce skb_network_header()
For the places where we need a pointer to the network header, it is still legal
to touch skb->nh.raw directly if just adding to, subtracting from or setting it
to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:59 -07:00
Arnaldo Carvalho de Melo
98e399f82a [SK_BUFF]: Introduce skb_mac_header()
For the places where we need a pointer to the mac header, it is still legal to
touch skb->mac.raw directly if just adding to, subtracting from or setting it
to another layer header.

This one also converts some more cases to skb_reset_mac_header() that my
regex missed as it had no spaces before nor after '=', ugh.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:41 -07:00
Arnaldo Carvalho de Melo
31713c333d [TCP]: Use skb_set_mac_header in tcp_collapse
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:39 -07:00
Arnaldo Carvalho de Melo
c51957dafa [TCP]: Do the layer header setting in tcp_collapse relative to skb->data
That is equal to skb->head before skb_reserve, to help in the layer header
changes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:38 -07:00
Stephen Hemminger
2de979bd7d [TCP]: whitespace cleanup
Add whitespace around keywords.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:13 -07:00
David S. Miller
fe067e8ab5 [TCP]: Abstract out all write queue operations.
This allows the write queue implementation to be changed,
for example, to one which allows fast interval searching.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:02 -07:00
James Morris
9d729f72dc [NET]: Convert xtime.tv_sec to get_seconds()
Where appropriate, convert references to xtime.tv_sec to the
get_seconds() helper function.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:32 -07:00
Ilpo Järvinen
e317f6f69c [TCP]: FRTO undo response falls back to ratehalving one if ECEd
Undoing ssthresh is disabled in fastretrans_alert whenever
FLAG_ECE is set by clearing prior_ssthresh. The clearing does
not protect FRTO because FRTO operates before fastretrans_alert.
Moving the clearing of prior_ssthresh earlier seems to be a
suboptimal solution to the FRTO case because then FLAG_ECE will
cause a second ssthresh reduction in try_to_open (the first
occurred when FRTO was entered). So instead, FRTO falls back
immediately to the rate halving response, which switches TCP to
CA_CWR state preventing the latter reduction of ssthresh.

If the first ECE arrived before the ACK after which FRTO is able
to decide RTO as spurious, prior_ssthresh is already cleared.
Thus no undoing for ssthresh occurs. Besides, FLAG_ECE should be
set also in the following ACKs resulting in rate halving response
that sees TCP is already in CA_CWR, which again prevents an extra
ssthresh reduction on that round-trip.

If the first ECE arrived before RTO, ssthresh has already been
adapted and prior_ssthresh remains cleared on entry because TCP
is in CA_CWR (the same applies also to a case where FRTO is
entered more than once and ECE comes in the middle).

High_seq must not be touched after tcp_enter_cwr because CWR
round-trip calculation depends on it.

I believe that after this patch, FRTO should be ECN-safe and
even able to take advantage of synergy benefits.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:26 -07:00
Ilpo Järvinen
e01f9d7793 [TCP]: Complete icsk-to-local-variable change (in tcp_enter_cwr)
A local variable for icsk was created but this change was
missing. Spotted by Jarek Poplawski.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:25 -07:00
Ilpo Järvinen
3cfe3baaf0 [TCP]: Add two new spurious RTO responses to FRTO
New sysctl tcp_frto_response is added to select amongst these
responses:
	- Rate halving based; reuses CA_CWR state (default)
	- Very conservative; used to be the only one available (=1)
	- Undo cwr; undoes ssthresh and cwnd reductions (=2)

The response with rate halving requires a new parameter to
tcp_enter_cwr because FRTO has already reduced ssthresh and
doing a second reduction there has to be prevented. In addition,
to keep things nice on 80 cols screen, a local variable was
added.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:23 -07:00
Ilpo Järvinen
c5e7af0df5 [TCP]: Correct reordering detection change (no FRTO case)
The reordering detection must work also when FRTO has not been
used at all which was the original intention of mine, just the
expression of the idea was flawed.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:22 -07:00
Ilpo Järvinen
4dc2665e36 [TCP]: SACK enhanced FRTO
Implements the SACK-enhanced FRTO given in RFC4138 using the
variant given in Appendix B.

RFC4138, Appendix B:
  "This means that in order to declare timeout spurious, the TCP
   sender must receive an acknowledgment for non-retransmitted
   segment between SND.UNA and RecoveryPoint in algorithm step 3.
   RecoveryPoint is defined in conservative SACK-recovery
   algorithm [RFC3517]"

The basic version of the FRTO algorithm can still be used also
when SACK is enabled. To enabled SACK-enhanced version, tcp_frto
sysctl is set to 2.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:16 -07:00
Ilpo Järvinen
288035f915 [TCP]: Prevent reordering adjustments during FRTO
To be honest, I'm not too sure how the reord stuff works in the
first place but this seems necessary.

When FRTO has been active, the one and only retransmission could
be unnecessary but the state and sending order might not be what
the sacktag code expects it to be (to work correctly).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:15 -07:00
Ilpo Järvinen
66e93e45c0 [TCP] FRTO: Fake cwnd for ssthresh callback
TCP without FRTO would be in Loss state with small cwnd. FRTO,
however, leaves cwnd (typically) to a larger value which causes
ssthresh to become too large in case RTO is triggered again
compared to what conventional recovery would do. Because
consecutive RTOs result in only a single ssthresh reduction,
RTO+cumulative ACK+RTO pattern is required to trigger this
event.

A large comment is included for congestion control module writers
trying to figure out what CA_EVENT_FRTO handler should do because
there exists a remote possibility of incompatibility between
FRTO and module defined ssthresh functions.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:14 -07:00
Ilpo Järvinen
d1a54c6a0a [TCP] FRTO: Reverse RETRANS bit clearing logic
Previously RETRANS bits were cleared on the entry to FRTO. We
postpone that into tcp_enter_frto_loss, which is really the
place were the clearing should be done anyway. This allows
simplification of the logic from a clearing loop to the head skb
clearing only.

Besides, the other changes made in the previous patches to
tcp_use_frto made it impossible for the non-SACKed FRTO to be
entered if other than the head has been rexmitted.

With SACK-enhanced FRTO (and Appendix B), however, there can be
a number retransmissions in flight when RTO expires (same thing
could happen before this patchset also with non-SACK FRTO). To
not introduce any jumpiness into the packet counting during FRTO,
instead of clearing RETRANS bits from skbs during entry, do it
later on.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:13 -07:00
Ilpo Järvinen
46d0de4ed9 [TCP] FRTO: Entry is allowed only during (New)Reno like recovery
This interpretation comes from RFC4138:
    "If the sender implements some loss recovery algorithm other
     than Reno or NewReno [FHG04], the F-RTO algorithm SHOULD
     NOT be entered when earlier fast recovery is underway."

I think the RFC means to say (especially in the light of
Appendix B) that ...recovery is underway (not just fast recovery)
or was underway when it was interrupted by an earlier (F-)RTO
that hasn't yet been resolved (snd_una has not advanced enough).
Thus, my interpretation is that whenever TCP has ever
retransmitted other than head, basic version cannot be used
because then the order assumptions which are used as FRTO basis
do not hold.

NewReno has only the head segment retransmitted at a time.
Therefore, walk up to the segment that has not been SACKed, if
that segment is not retransmitted nor anything before it, we know
for sure, that nothing after the non-SACKed segment should be
either. This assumption is valid because TCPCB_EVER_RETRANS does
not leave holes but each non-SACKed segment is rexmitted
in-order.

Check for retrans_out > 1 avoids more expensive walk through the
skb list, as we can know the result beforehand: F-RTO will not be
allowed.

SACKed skb can turn into non-SACked only in the extremely rare
case of SACK reneging, in this case we might fail to detect
retransmissions if there were them for any other than head. To
get rid of that feature, whole rexmit queue would have to be
walked (always) or FRTO should be prevented when SACK reneging
happens. Of course RTO should still trigger after reneging which
makes this issue even less likely to show up. And as long as the
response is as conservative as it's now, nothing bad happens even
then.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:12 -07:00
Ilpo Järvinen
7c9a4a5b67 [TCP]: Prevent unrelated cwnd adjustment while using FRTO
FRTO controls cwnd when it still processes the ACK input or it
has just reverted back to conventional RTO recovery; the normal
rules apply when FRTO has reverted to standard congestion
control.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:11 -07:00
Ilpo Järvinen
94d0ea7786 [TCP] FRTO: frto_counter modulo-op converted to two assignments
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:10 -07:00
Ilpo Järvinen
52c63f1e86 [TCP]: Don't enter to fast recovery while using FRTO
Because TCP is not in Loss state during FRTO recovery, fast
recovery could be triggered by accident. Non-SACK FRTO is more
robust than not yet included SACK-enhanced version (that can
receiver high number of duplicate ACKs with SACK blocks during
FRTO), at least with unidirectional transfers, but under
extraordinary patterns fast recovery can be incorrectly
triggered, e.g., Data loss+ACK losses => cumulative ACK with
enough SACK blocks to meet sacked_out >= dupthresh condition).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:09 -07:00
Ilpo Järvinen
aa8b6a7ad1 [TCP] FRTO: Response should reset also snd_cwnd_cnt
Since purpose is to reduce CWND, we prevent immediate growth. This
is not a major issue nor is "the correct way" specified anywhere.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:08 -07:00
Ilpo Järvinen
95c4922bf9 [TCP] FRTO: fixes fallback to conventional recovery
The FRTO detection did not care how ACK pattern affects to cwnd
calculation of the conventional recovery. This caused incorrect
setting of cwnd when the fallback becames necessary. The
knowledge tcp_process_frto() has about the incoming ACK is now
passed on to tcp_enter_frto_loss() in allowed_segments parameter
that gives the number of segments that must be added to
packets-in-flight while calculating the new cwnd.

Instead of snd_una we use FLAG_DATA_ACKED in duplicate ACK
detection because RFC4138 states (in Section 2.2):
  If the first acknowledgment after the RTO retransmission
  does not acknowledge all of the data that was retransmitted
  in step 1, the TCP sender reverts to the conventional RTO
  recovery.  Otherwise, a malicious receiver acknowledging
  partial segments could cause the sender to declare the
  timeout spurious in a case where data was lost.

If the next ACK after RTO is duplicate, we do not retransmit
anything, which is equal to what conservative conventional
recovery does in such case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:07 -07:00
Ilpo Järvinen
6408d206c7 [TCP] FRTO: Ignore some uninteresting ACKs
Handles RFC4138 shortcoming (in step 2); it should also have case
c) which ignores ACKs that are not duplicates nor advance window
(opposite dir data, winupdate).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:06 -07:00
Ilpo Järvinen
7b0eb22b1d [TCP] FRTO: Use Disorder state during operation instead of Open
Retransmission counter assumptions are to be changed. Forcing
reason to do this exist: Using sysctl in check would be racy
as soon as FRTO starts to ignore some ACKs (doing that in the
following patches). Userspace may disable it at any moment
giving nice oops if timing is right. frto_counter would be
inaccessible from userspace, but with SACK enhanced FRTO
retrans_out can include other than head, and possibly leaving
it non-zero after spurious RTO, boom again.

Luckily, solution seems rather simple: never go directly to Open
state but use Disorder instead. This does not really change much,
since TCP could anyway change its state to Disorder during FRTO
using path tcp_fastretrans_alert -> tcp_try_to_open (e.g., when
a SACK block makes ACK dubious). Besides, Disorder seems to be
the state where TCP should be if not recovering (in Recovery or
Loss state) while having some retransmissions in-flight (see
tcp_try_to_open), which is exactly what happens with FRTO.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:05 -07:00
Ilpo Järvinen
7487c48c4f [TCP] FRTO: Consecutive RTOs keep prior_ssthresh and ssthresh
In case a latency spike causes more than one RTO, the later should not
cause the already reduced ssthresh to propagate into the prior_ssthresh
since FRTO declares all such RTOs spurious at once or none of them. In
treating of ssthresh, we mimic what tcp_enter_loss() does.

The previous state (in frto_counter) must be available until we have
checked it in tcp_enter_frto(), and also ACK information flag in
process_frto().

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:04 -07:00
Ilpo Järvinen
30935cf4f9 [TCP] FRTO: Comment cleanup & improvement
Moved comments out from the body of process_frto() to the head
(preferred way; see Documentation/CodingStyle). Bonus: it's much
easier to read in this compacted form.

FRTO algorithm and implementation is described in greater detail.
For interested reader, more information is available in RFC4138.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:03 -07:00
Ilpo Järvinen
bdaae17da8 [TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.c
In addition, removed inline.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:02 -07:00
Ilpo Järvinen
9ead9a1d38 [TCP] FRTO: Separated response from FRTO detection algorithm
FRTO spurious RTO detection algorithm (RFC4138) does not include response
to a detected spurious RTO but can use different response algorithms.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:01 -07:00
Ilpo Järvinen
522e7548a9 [TCP] FRTO: Incorrectly clears TCPCB_EVER_RETRANS bit
FRTO was slightly too brave... Should only clear
TCPCB_SACKED_RETRANS bit.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:00 -07:00
YOSHIFUJI Hideaki
e905a9edab [NET] IPV4: Fix whitespace errors.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-10 23:19:39 -08:00
Baruch Even
8a3c3a9727 [TCP]: Check num sacks in SACK fast path
We clear the unused parts of the SACK cache, This prevents us from mistakenly
taking the cache data if the old data in the SACK cache is the same as the data
in the SACK block. This assumes that we never receive an empty SACK block with
start and end both at zero.

Signed-off-by: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-08 12:38:50 -08:00
Baruch Even
6f74651ae6 [TCP]: Seperate DSACK from SACK fast path
Move DSACK code outside the SACK fast-path checking code. If the DSACK
determined that the information was too old we stayed with a partial cache
copied. Most likely this matters very little since the next packet will not be
DSACK and we will find it in the cache. but it's still not good form and there
is little reason to couple the two checks.

Since the SACK receive cache doesn't need the data to be in host order we also
remove the ntohl in the checking loop.

Signed-off-by: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-08 12:38:49 -08:00
Baruch Even
fda03fbb56 [TCP]: Advance fast path pointer for first block only
Only advance the SACK fast-path pointer for the first block, the
fast-path assumes that only the first block advances next time so we
should not move the cached skb for the next sack blocks.

Signed-off-by: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-08 12:38:48 -08:00
Baruch Even
db3ccdac26 [TCP]: Fix sorting of SACK blocks.
The sorting of SACK blocks actually munges them rather than sort,
causing the TCP stack to ignore some SACK information and breaking the
assumption of ordered SACK blocks after sorting.

The sort takes the data from a second buffer which isn't moved causing
subsequent data moves to occur from the wrong location. The fix is to
use a temporary buffer as a normal sort does.

Signed-off-By: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-01-25 13:35:06 -08:00
Masayuki Nakagawa
fb7e2399ec [TCP]: skb is unexpectedly freed.
I encountered a kernel panic with my test program, which is a very
simple IPv6 client-server program.

The server side sets IPV6_RECVPKTINFO on a listening socket, and the
client side just sends a message to the server.  Then the kernel panic
occurs on the server.  (If you need the test program, please let me
know. I can provide it.)

This problem happens because a skb is forcibly freed in
tcp_rcv_state_process().

When a socket in listening state(TCP_LISTEN) receives a syn packet,
then tcp_v6_conn_request() will be called from
tcp_rcv_state_process().  If the tcp_v6_conn_request() successfully
returns, the skb would be discarded by __kfree_skb().

However, in case of a listening socket which was already set
IPV6_RECVPKTINFO, an address of the skb will be stored in
treq->pktopts and a ref count of the skb will be incremented in
tcp_v6_conn_request().  But, even if the skb is still in use, the skb
will be freed.  Then someone still using the freed skb will cause the
kernel panic.

I suggest to use kfree_skb() instead of __kfree_skb().

Signed-off-by: Masayuki Nakagawa <nakagawa.msy@ncos.nec.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-01-23 20:25:52 -08:00
Ralf Baechle
e16aa207cc [NET]: Memory barrier cleanups
I believe all the below memory barriers only matter on SMP so
therefore the smp_* variant of the barrier should be used.

I'm wondering if the barrier in net/ipv4/inet_timewait_sock.c should be
dropped entirely.  schedule_work's implementation currently implies a
memory barrier and I think sane semantics of schedule_work() should imply
a memory barrier, as needed so the caller shouldn't have to worry.
It's not quite obvious why the barrier in net/packet/af_packet.c is
needed; maybe it should be implied through flush_dcache_page?

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-07 00:11:33 -08:00
Al Viro
b51655b958 [NET]: Annotate __skb_checksum_complete() and friends.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:23:38 -08:00
YOSHIFUJI Hideaki
cfb6eeb4c8 [TCP]: MD5 Signature Option (RFC2385) support.
Based on implementation by Rick Payne.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:39 -08:00
Venkat Yekkirala
6b877699c6 SELinux: Return correct context for SO_PEERSEC
Fix SO_PEERSEC for tcp sockets to return the security context of
the peer (as represented by the SA from the peer) as opposed to the
SA used by the local/source socket.

Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: James Morris <jmorris@namei.org>
2006-12-02 21:21:33 -08:00
David S. Miller
80246ab36e [TCP]: Kill warning in tcp_clean_rtx_queue().
GCC can't tell we always initialize 'tv' in all the cases
we actually use it, so explicitly set it up with zeros.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-10-04 00:31:08 -07:00
John Heffner
8ea333eb5d [TCP]: Fix and simplify microsecond rtt sampling
This changes the microsecond RTT sampling so that samples are taken in
the same way that RTT samples are taken for the RTO calculator: on the
last segment acknowledged, and only when the segment hasn't been
retransmitted.

Signed-off-by: John Heffner <jheffner@psc.edu>
Acked-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-28 18:03:08 -07:00
Al Viro
4f3608b787 [TCP] net/ipv4/tcp_input.c: trivial annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-28 18:02:11 -07:00
Al Viro
269bd27e66 [TCP]: struct tcp_sack_block annotations
Some of the instances of tcp_sack_block are host-endian, some - net-endian.
Define struct tcp_sack_block_wire identical to struct tcp_sack_block
with u32 replaced with __be32; annotate uses of tcp_sack_block replacing
net-endian ones with tcp_sack_block_wire.  Change is obviously safe since
for cc(1) __be32 is typedefed to u32.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-28 18:02:04 -07:00
Alexey Kuznetsov
1ef9696c90 [TCP]: Send ACKs each 2nd received segment.
It does not affect either mss-sized connections (obviously) or
connections controlled by Nagle (because there is only one small
segment in flight).

The idea is to record the fact that a small segment arrives on a
connection, where one small segment has already been received and
still not-ACKed. In this case ACK is forced after tcp_recvmsg() drains
receive buffer.

In other words, it is a "soft" each-2nd-segment ACK, which is enough
to preserve ACK clock even when ABC is enabled.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 15:19:05 -07:00
Herbert Xu
ff9b5e0f08 [TCP]: Fix rcv mss estimate for LRO
By passing a Linux-generated TSO packet straight back into Linux, Xen
becomes our first LRO user :) Unfortunately, there is at least one spot
in our stack that needs to be changed to cope with this.

The receive MSS estimate is computed from the raw packet size.  This is
broken if the packet is GSO/LRO.  Fortunately the real MSS can be found
in gso_size so we simply need to use that if it is non-zero.

Real LRO NICs should of course set the gso_size field in future.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 15:18:44 -07:00
Brian Haley
ab32ea5d8a [NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly
Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly.

Couldn't actually measure any performance increase while testing (.3%
I consider noise), but seems like the right thing to do.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:55:03 -07:00
Stephen Hemminger
b3a8a40da5 [TCP]: Turn ABC off.
Turn Appropriate Byte Count off by default because it unfairly
penalizes applications that do small writes.  Add better documentation
to describe what it is so users will understand why they might want to
turn it on.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-17 23:21:02 -07:00
Daikichi Osuga
3fdf3f0c99 [TCP]: Two RFC3465 Appropriate Byte Count fixes.
1) fix slow start after retransmit timeout
2) fix case of L=2*SMSS acked bytes comparison

Signed-off-by: Daikichi Osuga <osugad@s1.nttdocomo.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-29 21:22:16 -07:00
Ilpo Järvinen
d254bcdbf2 [TCP]: Fixes IW > 2 cases when TCP is application limited
Whenever a transfer is application limited, we are allowed at least
initial window worth of data per window unless cwnd is previously
less than that.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-04 22:59:52 -07:00