Commit Graph

605 Commits

Author SHA1 Message Date
Fabian Frederick
7e5fec3168 drbd: code cleanups without semantic changes
This contains various cosmetic fixes ranging from simple typos to
const-ifying, and using booleans properly.

Original commit messages from Fabian's patch set:
drbd: debugfs: constify drbd_version_fops
drbd: use seq_put instead of seq_print where possible
drbd: include linux/uaccess.h instead of asm/uaccess.h
drbd: use const char * const for drbd strings
drbd: kerneldoc warning fix in w_e_end_data_req()
drbd: use unsigned for one bit fields
drbd: use bool for peer is_ states
drbd: fix typo
drbd: use | for bitmask combination
drbd: use true/false for bool
drbd: fix drbd_bm_init() comments
drbd: introduce peer state union
drbd: fix maybe_pull_ahead() locking comments
drbd: use bool for growing
drbd: remove redundant declarations
drbd: replace if/BUG by BUG_ON

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg
f2d3d75b66 drbd: sync_handshake: handle identical uuids with current (frozen) Primary
If in a two-primary scenario, we lost our peer, freeze IO,
and are still frozen (no UUID rotation) when the peer comes back
as Secondary after a hard crash, we will see identical UUIDs.

The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
arbitration, but that would cause the still running (but frozen) Primary
to become SyncTarget (which it typically refuses), and the handshake is
declined.

Fix: check current roles.
If we have *one* current primary, the Primary wins.
(rule_nr = 41)

Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
to determine if rule_nr = 41 can be applied.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg
9104d31a75 drbd: introduce WRITE_SAME support
We will support WRITE_SAME, if
 * all peers support WRITE_SAME (both in kernel and DRBD version),
 * all peer devices support WRITE_SAME
 * logical_block_size is identical on all peers.

We may at some point introduce a fallback on the receiving side
for devices/kernels that do not support WRITE_SAME,
by open-coding a submit loop. But not yet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg
60bac04012 drbd: report sizes if rejecting too small peer disk
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg
dd4f699da6 drbd: when receiving P_TRIM, zero-out partial unaligned chunks
We can avoid spurious data divergence caused by partially-ignored
discards on certain backends with discard_zeroes_data=0, if we
translate partial unaligned discard requests into explicit zero-out.

The relevant use case is LVM/DM thin.

If on different nodes, DRBD is backed by devices with differing
discard characteristics, discards may lead to data divergence
(old data or garbage left over on one backend, zeroes due to
unmapped areas on the other backend). Online verify would now
potentially report tons of spurious differences.

While probably harmless for most use cases (fstrim on a file system),
DRBD cannot have that, it would violate our promise to upper layers
that our data instances on the nodes are identical.

To be correct and play safe (make sure data is identical on both copies),
we would have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true".

We'd also have to translate discards to explicit zero-out on the
receiving (typically: Secondary) side, unless the receiving side
supports "discard_zeroes_data=true".

Which both would allocate those blocks, instead of unmapping them,
in contrast with expectations.

LVM/DM thin does set discard_zeroes_data=0,
because it silently ignores discards to partial chunks.

We can work around this by checking the alignment first.
For unaligned (wrt. alignment and granularity) or too small discards,
we zero-out the initial (and/or) trailing unaligned partial chunks,
but discard all the aligned full chunks.

At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".

Arguably it should behave this way internally, by default,
and we'll try to make that happen.

But our workaround is still valid for already deployed setups,
and for other devices that may behave this way.

Setting discard-zeroes-if-aligned=yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on
backends that announce discard_zeroes_data=false.

Setting discard-zeroes-if-aligned=no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even
announce discard capabilities on the Primary, if the respective
backend announces discard_zeroes_data=false.

We used to ignore the discard_zeroes_data setting completely.
To not break established and expected behaviour, and suddenly
cause fstrim on thin-provisioned LVs to run out-of-space,
instead of freeing up space, the default value is "yes".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg
f9ff0da564 drbd: allow parallel flushes for multi-volume resources
To maintain write-order fidelity accros all volumes in a DRBD resource,
the receiver of a P_BARRIER needs to issue flushes to all volumes.
We used to do this by calling blkdev_issue_flush(), synchronously,
one volume at a time.

We now submit all flushes to all volumes in parallel, then wait for all
completions, to reduce worst-case latencies on multi-volume resources.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Philipp Reisner
92d94ae66a drbd: Create the protocol feature THIN_RESYNC
If thinly provisioned volumes are used, during a resync the sync source
tries to find out if a block is deallocated. If it is deallocated, then
the resync target uses block_dev_issue_zeroout() on the range in
question.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:04 -06:00
Philipp Reisner
700ca8c04a drbd: Implement handling of thinly provisioned storage on resync target nodes
If during resync we read only zeroes for a range of sectors assume
that these secotors can be discarded on the sync target node.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:04 -06:00
Lars Ellenberg
be115b69f1 drbd: change bitmap write-out when leaving resync states
When leaving resync states because of disconnect,
do the bitmap write-out synchronously in the drbd_disconnected() path.

When leaving resync states because we go back to AHEAD/BEHIND, or
because resync actually finished, or some disk was lost during resync,
trigger the write-out from after_state_ch().

The bitmap write-out for resync -> ahead/behind was missing completely before.

Note that this is all only an optimization to avoid double-resyncs of
already completed blocks in case this node crashes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:03 -06:00
Mike Christie
28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
bb3cc85e16 drbd: use bio op accessors
Separate the op from the rq_flag_bits and have drbd
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Herbert Xu
9534d67195 drbd: Use shash and ahash
This patch replaces uses of the long obsolete hash interface with
either shash (for non-SG users) or ahash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-27 20:36:08 +08:00
Lars Ellenberg
39e91a60c8 drbd: use resource name in workqueue
Since kernel 3.3, we can use snprintf-style arguments
to create a workqueue.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25 09:22:02 -07:00
Philipp Reisner
668700b40a drbd: Create a dedicated workqueue for sending acks on the control connection
The intention is to reduce CPU utilization. Recent measurements
unveiled that the current performance bottleneck is CPU utilization
on the receiving node. The asender thread became CPU limited.

One of the main points is to eliminate the idr_for_each_entry() loop
from the sending acks code path.

One exception in that is sending back ping_acks. These stay
in the ack-receiver thread. Otherwise the logic becomes too
complicated for no added value.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25 09:22:01 -07:00
Philipp Reisner
1c03e52083 drbd: Rename asender to ack_receiver
This prepares the next patch where the sending on the meta (or
control) socket is moved to a dedicated workqueue.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25 09:22:01 -07:00
Andreas Gruenbacher
a29728463b drbd: Backport the "events2" command
The events2 command originates from drbd-9 development. It features
more information but requires a incompatible change in output
format.
Therefore the previous events command continues to exist, the new
improved events2 command becomes available now.

This prepares the user-base for a later switch to the complete
drbd9 code base.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25 09:22:00 -07:00
Andreas Gruenbacher
f6ba863639 drbd: Move enum write_ordering_e to drbd.h
Also change the enum values to all-capital letters.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25 09:22:00 -07:00
Andreas Gruenbacher
5dd2ca1912 drbd: Get rid of some first_peer_device() calls
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25 09:22:00 -07:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Eric W. Biederman
eeb1bd5c40 net: Add a struct net parameter to sock_create_kern
This is long overdue, and is part of cleaning up how we allocate kernel
sockets that don't reference count struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 10:50:17 -04:00
Martin K. Petersen
d93ba7a5a9 block: Add discard flag to blkdev_issue_zeroout() function
blkdev_issue_discard() will zero a given block range. This is done by
way of explicit writing, thus provisioning or allocating the blocks on
disk.

There are use cases where the desired behavior is to zero the blocks but
unprovision them if possible. The blocks must deterministically contain
zeroes when they are subsequently read back.

This patch adds a flag to blkdev_issue_zeroout() that provides this
variant. If the discard flag is set and a block device guarantees
discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
the device does not support discard_zeroes_data or if the discard
request fails we will fall back to first REQ_WRITE_SAME and then a
regular REQ_WRITE.

Also update the callers of blkdev_issue_zero() to reflect the new flag
and make sb_issue_zeroout() prefer the discard approach.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-21 10:41:46 -07:00
Lars Ellenberg
ff8bd88b73 drbd: fix resync throttling initialization
If for some reason DRBD resync was the only activity on a backend
device, drbd_rs_c_min_rate_throttle() would mistakenly decide that it is
still initialization time, and keep throttling the resync.

This patch explicitly initializes ->rs_last_events to the current
backend event counters, and drops the rs_last_events == 0 from the
throttle condition.

Reported-by: Mikhail Sugakov <msugakov@amazon.de>

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10 09:27:37 -07:00
Lars Ellenberg
abde9cc6a5 drbd: Improve asender performance
Shorten receive path in the asender thread. Reduces CPU utilisation
of asender when receiving packets, and with that increases IOPs.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-11 08:41:29 -06:00
Andreas Gruenbacher
11f8b2b69d drbd: Use better variable names
Rename local variable 'ds' to 'disk_state' or 'data_size'.
'dgs' to 'digest_size'

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-11 08:41:29 -06:00
Dan Carpenter
bf0d6e4a11 drbd: silence underflow warning in read_in_block()
My static checker warns that "data_size" could be negative and underflow
the limit check.  The code looks suspicious but I don't know if it is a
real bug.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:23 +02:00
Lars Ellenberg
944410e97c drbd: debugfs: add callback_history
Add a per-connection worker thread callback_history
with timing details, call site and callback function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:18 +02:00
Lars Ellenberg
f418815f7a drbd: debugfs: Add in_flight_summary
* Add details about pending meta data operations to in_flight_summary.

* Report number of requests waiting for activity log transactions.

* timing details of peer_requests to in_flight_summary.

* FLUSH details
  DRBD devides the incoming request stream into "epochs",
  in which peers are allowed to re-order writes independendly.

  These epochs are separated by P_BARRIER on the replication link.
  Such barrier packets, depending on configuration, may cause
  the receiving side to drain the lower level device request queues
  and call blkdev_issue_flush().

  This is known to be an other major source of latency in DRBD.

  Track timing details of calls to blkdev_issue_flush(),
  and add them to in_flight_summary.

* data socket stats
  To be able to diagnose bottlenecks and root causes of "slow" IO on DRBD,
  it is useful to see network buffer stats along with the timing details of
  requests, peer requests, and meta data IO.

* pending bitmap IO timing details to in_flight_summary.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:17 +02:00
Lars Ellenberg
c5a2c1509e drbd: register peer requests on read_ee early
Initialize peer_request with timestamp and proper empty list head.
Add peer_request to list early, so debugfs can find this request and
report it as "preparing", even if we sleep before we actually submit it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:14 +02:00
Lars Ellenberg
21ae5d7f95 drbd: track timing details of peer_requests
To be able to present timing details in debugfs,
we need to track preparation/submit times of peer requests.

Track peer request flags early,
before they are put on the epoch_entry lists.

Waiting for activity log transactions may be a major latency factor.
We want to be able to present the peer_request state accurately in
debugfs, and what it is waiting for.

Consistently mark/unmark peer requests with EE_CALL_AL_COMPLETE_IO.
Set it only *after* calling drbd_al_begin_io(),
clear it as soon as we call drbd_al_complete_io().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:14 +02:00
Lars Ellenberg
ad3fee7900 drbd: improve throttling decisions of background resynchronisation
Background resynchronisation does some "side-stepping", or throttles
itself, if it detects application IO activity, and the current resync
rate estimate is above the configured "cmin-rate".

What was not detected: if there is no application IO,
because it blocks on activity log transactions.

Introduce a new atomic_t ap_actlog_cnt, tracking such blocked requests,
and count non-zero as application IO activity.
This counter is exposed at proc_details level 2 and above.

Also make sure to release the currently locked resync extent
if we side-step due to such voluntary throttling.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:13 +02:00
Lars Ellenberg
45d2933c92 drbd: also keep track of trim -> zero-out fallback peer_requests
To be able to find and present such zero-out fallback peer_requests
in debugfs, we add those to "active_ee", once that list drained.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:09 +02:00
Lars Ellenberg
b9ed7080d7 drbd: consistently use list_add_tail for peer_request tracking
Keep the epoch entry lists (active_ee, read_ee, sync_ee, ...)
consistently "oldest first".  That way finding the oldest not yet
successfully processed request is simply list_first_entry_or_null.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:08 +02:00
Lars Ellenberg
08d0dabf48 drbd: application writes may set-in-sync in protocol != C
If "dirty" blocks are written to during resync,
that brings them in-sync.

By explicitly requesting write-acks during resync even in protocol != C,
we now can actually respect this.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:02 +02:00
Philipp Reisner
5d0b17f1a2 drbd: New net configuration option socket-check-timeout
In setups involving a DRBD-proxy and connections that experience a lot of
buffer-bloat it might be necessary to set ping-timeout to an
unusual high value. By default DRBD uses the same value to wait if a newly
established TCP-connection is stable. Since the DRBD-proxy is usually located
in the same data center such a long wait time may hinder DRBD's connect process.

In such setups socket-check-timeout should be set to
at least to the round trip time between DRBD and DRBD-proxy. I.e. in most
cases to 1.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:01 +02:00
Philipp Reisner
4920e37a9e drbd: Limit the time we are waiting for the first packet on an accepted socket
Before the patch
'drbd: Keep the listening socket open while trying to connect to the peer'

the newly created socket inherited the receive timeout from the listen
socket. The listen socket had a receive timeout of connect-intervall
+- 30% random jitter.

The real issue is that after the mentioned patch we had no timeout at all.
Now use 4 times the ping-timeout.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:00 +02:00
Lars Ellenberg
aaaba34576 drbd: implement csums-after-crash-only
Checksum based resync trades CPU cycles for network bandwidth,
in situations where we expect much of the to-be-resynced blocks
to be actually identical on both sides already.

In a "network hickup" scenario, it won't help:
all to-be-resynced blocks will typically be different.

The use case is for the resync of *potentially* different blocks
after crash recovery -- the crash recovery had marked larger areas
(those covered by the activity log) as need-to-be-resynced,
just in case. Most of those blocks will be identical.

This option makes it possible to configure checksum based resync,
but only actually use it for the first resync after primary crash.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:35:00 +02:00
Lars Ellenberg
6a8d68b187 drbd: don't implicitly resize Diskless node beyond end of device
During handshake, we compare backend sizes, and user set limits,
and agree on what device size we are going to expose.

We remember that last-agreed-size in our meta data.

But if we come up diskless, we have to accept what the peer
presents us with. We used to accept the peers maximum potential
capacity (backend size), which is wrong, and could lead to IO errors
due to access beyond end of device.

Instead, we need to accept the peer's current size.
Unless that is communicated as 0, in which case we
accept the backend size, or the user set limit, if set.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:59 +02:00
Lars Ellenberg
8ce953aa39 drbd: silence -Wmissing-prototypes warnings
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:57 +02:00
Lars Ellenberg
4dd726f029 drbd: get rid of drbd_queue_work_front
The last user was al_write_transaction, if called with "delegate",
and the last user to call it with "delegate = true" was the receiver
thread, which has no need to delegate, but can call it himself.

Finally drop the delegate parameter, drop the extra
w_al_write_transaction callback, and drop drbd_queue_work_front.

Do not (yet) change dequeue_work_item to dequeue_work_batch, though.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 18:34:56 +02:00
Lars Ellenberg
70df70927b drbd: allow write-ordering policy to be bumped up again
Previously, once you disabled flushes as a means of enforcing
write-ordering, you'd need to detach/re-attach to enable them again.

Allow drbdsetup disk-options to re-enable previously disabled
write-ordering policy options at runtime.

While at it fix RCU in drbd_bump_write_ordering()
max_allowed_wo() uses rcu_dereference, therefore it must
be called within rcu_read_lock()/rcu_read_unlock()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:22 +02:00
Lars Ellenberg
44a4d55184 drbd: refactor use of first_peer_device()
Reduce the number of calls to first_peer_device(). Instead, call
first_peer_device() just once to assign a local variable peer_device.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:22 +02:00
Philipp Reisner
8fe39aac05 drbd: device->ldev is not guaranteed on an D_ATTACHING disk
Some parts of the code assumed that get_ldev_if_state(device, D_ATTACHING)
is sufficient to access the ldev member of the device object. That was
wrong. ldev may not be there or might be freed at any time if the device
has a disk state of D_ATTACHING.

bm_rw()
  Documented that drbd_bm_read() is only called from drbd_adm_attach.
  drbd_bm_write() is only called when a reference is held, and it is
  documented that a caller has to hold a reference before calling
  drbd_bm_write()

drbd_bm_write_page()
  Use get_ldev() instead of get_ldev_if_state(device, D_ATTACHING)

drbd_bmio_set_n_write()
  No longer use get_ldev_if_state(device, D_ATTACHING). All callers
  hold a reference to ldev now.

drbd_bmio_clear_n_write()
  All callers where holding a reference of ldev anyways. Remove the
  misleading get_ldev_if_state(device, D_ATTACHING)

drbd_reconsider_max_bio_size()
  Removed the get_ldev_if_state(device, D_ATTACHING). All callers
  now pass a struct drbd_backing_dev* when they have a proper
  reference, or a NULL pointer.
  Before this fix, the receiver could trigger a NULL pointer
  deref when in drbd_reconsider_max_bio_size()

drbd_bump_write_ordering()
  Used get_ldev_if_state(device, D_ATTACHING) with the wrong assumption.
  Remove it, and allow the caller to pass in a struct drbd_backing_dev*
  when the caller knows that accessing this bdev is safe.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:20 +02:00
Philipp Reisner
e952658020 drbd: Move write_ordering from connection to resource
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2014-07-10 15:22:19 +02:00
Lars Ellenberg
54ed4ed8f9 drbd: fix NULL pointer deref in blk_add_request_payload
Discards don't have any payload.
But the scsi layer still expects a bio_vec it can use internally,
see sd_setup_discard_cmnd() and blk_add_request_payload().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-25 09:53:47 -06:00
Philipp Reisner
67cca286ca drbd: Fix a hole in the challange-response connection authentication
In the implementation as it was, the two peers sent each other
a challenge, and expects the challenge hashed with the shared
secret back.

A attacker could simply wait for the challenge of the peer, and
send the same challenge back. Then it waits for the response, and
sends the same response back.

Prevent this by not accepting a challenge from the peer that is
the same as the challenge sent to the peer.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30 13:46:55 -06:00
Lars Ellenberg
20c68fdea1 drbd: Enable QUEUE_FLAG_DISCARD only if the peer can recieve P_TRIM
Allow the user of REQ_DISCARD.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30 13:46:55 -06:00
Lars Ellenberg
a0fb3c47a1 drbd: prepare receiving side for REQ_DISCARD
If the receiver needs to serve a discard request on a queue that does
not announce to be discard cabable, it falls back to do synchronous
blkdev_issue_zeroout().

We expect only "reasonably" large (up to one activity log extent?)
discard requests.

We do this to not to not block the receiver for too long in this
fallback code path, and to not set/clear too many bits inside one
spinlock_irq_save() in drbd_set_in_sync/drbd_set_out_of_sync,

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30 13:46:55 -06:00
Lars Ellenberg
e829987433 drbd: don't let application IO pre-empt resync too often
Before, application IO could pre-empt resync activity
for up to hardcoded 20 seconds per resync request.
A very busy server could throttle the effective resync bandwidth
down to one request per 20 seconds.

Now, we only let application IO pre-empt resync traffic
while the current resync rate estimate is above c-min-rate.

If you disable the c-min-rate throttle feature (set c-min-rate = 0),
application IO will no longer pre-empt resync traffic at all.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30 13:46:54 -06:00
Lars Ellenberg
0e49d7b014 drbd: fix potential distributed deadlock during verify or resync
If max-buffers and socket buffer sizes are "too small" for the chosen
resync rate, this could lead potentially lead to a distributed deadlock,
which may or may not resolve itself via the "ko-count" and request
timeout mechanism, or could be resolved by forced disconnect.

One option to deal with this is proper configuration:
use larger max-buffer and socket buffers settings,
or reduce the resync rate.

But even with bad configuration we should not deadlock,
but "gracefully" recover.

The issue is avoided by using only up to max-buffers/2 for resync
requests, and by using max-buffers not as a hard limit for data buffer
allocations, but as a throttle threshold only.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30 13:46:54 -06:00
Philipp Reisner
31007745a5 drbd: Break a deadlock while concurrent fencing and establishing a connection
When we need to outdate the peer while being promoted to primary,
and the connection gets established at the same time, we deadlock
in drbd_try_outdate_peer() when trying to clear the susp_fen
bit.

Fix this by setting the STATE_SENT bit while holding the mutex.

Using drbd_change_state(.. , CS_HARD, ..) which does not block
until STATE_SENT is cleared, is only for clearness. It does
not contribute anything to the fix.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30 13:46:53 -06:00
Linus Torvalds
5166701b36 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe->buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
2014-04-12 14:49:50 -07:00
Al Viro
f730c848af drbd: don't open-code kernel_recvmsg()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:32 -04:00
Andreas Gruenbacher
81f0ffd2a2 drbd: Remove unused parameter of wire_flags_to_bio()
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:50:46 +01:00
Andreas Gruenbacher
e33b32de04 drbd: Get rid of first_peer_device() in handle_write_conflicts()
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:50:45 +01:00
Andreas Gruenbacher
a8cd15ba79 drbd: struct drbd_peer_request: Use drbd_work instead of drbd_device_work
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:50:41 +01:00
Andreas Gruenbacher
b5043c5e2c drbd: Turn conn_flush_workqueue() into drbd_flush_workqueue()
The new function can flush any work queue, not just the work queue of the data
socket of a connection.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:50:40 +01:00
Andreas Gruenbacher
84b8c06b65 drbd: Create a dedicated struct drbd_device_work
drbd_device_work is a work item that has a reference to a device,
while drbd_work is a more generic work item that does not carry
a reference to a device.

All callbacks get a pointer to a drbd_work instance, those callbacks
that expect a drbd_device_work use the container_of macro to get it.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:50:39 +01:00
Andreas Gruenbacher
8682eae9b4 drbd: Rename w_prev_work_done -> w_complete
Also move it to drbd_receiver.c and make it static.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:50:38 +01:00
Andreas Gruenbacher
69a227731a drbd: Pass a peer device to a number of fuctions
These functions actually operate on a peer device, or
need a peer device.

drbd_prepare_command(), drbd_send_command(), drbd_send_sync_param()
drbd_send_uuids(), drbd_gen_and_send_sync_uuid(), drbd_send_sizes()
drbd_send_state(), drbd_send_current_state(), and drbd_send_state_req()
drbd_send_sr_reply(), drbd_send_ack(), drbd_send_drequest(),
drbd_send_drequest_csum(), drbd_send_ov_request(), drbd_send_dblock()
drbd_send_block(), drbd_send_out_of_sync(), recv_dless_read()
drbd_drain_block(), receive_bitmap_plain(), recv_resync_read()
read_in_block(), read_for_csum(), drbd_alloc_pages(), drbd_alloc_peer_req()
need_peer_seq(), update_peer_seq(), wait_for_and_update_peer_seq()
drbd_sync_handshake(), drbd_asb_recover_{0,1,2}p(), drbd_connected()
drbd_disconnected(), decode_bitmap_c() and recv_bm_rle_bits()

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:50:31 +01:00
Andreas Gruenbacher
9f4fe9ad20 drbd: Replace vnr_to_mdev() with conn_peer_device()
The new function returns a peer device, which allows us to eliminate a few
instances of first_peer_device().

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:49:37 +01:00
Andreas Gruenbacher
79a3c8d38c drbd: drbd_csum_bio(), drbd_csum_ee(): Remove unused device argument
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:49:36 +01:00
Andreas Gruenbacher
8fe605513a drbd: Rename drbdd_init() -> drbd_receiver()
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:47:58 +01:00
Andreas Gruenbacher
0500813fe0 drbd: Move conf_mutex from connection to resource
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:46:46 +01:00
Andreas Gruenbacher
0b0ba1efc7 drbd: Add explicit device parameter to D_ASSERT
The implicit dependency on a variable inside the macro is problematic.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:45:04 +01:00
Andreas Gruenbacher
1ec861ebd0 drbd: Replace and remove the obsolete conn_() macros
With the polymorphic drbd_() macros, we no longer need the connection
specific variants.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:45:03 +01:00
Andreas Gruenbacher
d018017102 drbd: Remove the terrible DEV hack
DRBD was using dev_err() and similar all over the code; instead of having to
write dev_err(disk_to_dev(device->vdisk), ...) to convert a drbd_device into a
kernel device, a DEV macro was used which implicitly references the device
variable.  This is terrible; introduce separate drbd_err() and similar macros
with an explicit device parameter instead.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:45:01 +01:00
Andreas Gruenbacher
c06ece6ba6 drbd: Turn connection->volumes into connection->peer_devices
Let connection->peer_devices point to peer devices; connection->volumes was
pointing to devices.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:45:00 +01:00
Andreas Gruenbacher
05a10ec790 drbd: Improve some function and variable naming
Rename functions
conn_destroy() -> drbd_destroy_connection(),
drbd_minor_destroy() -> drbd_destroy_device()
drbd_adm_add_minor() -> drbd_adm_add_minor()
drbd_adm_delete_minor() -> drbd_adm_del_minor()

Rename global variable minors to drbd_devices

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:52 +01:00
Andreas Gruenbacher
a6b32bc3ce drbd: Introduce "peer_device" object between "device" and "connection"
In a setup where a device (aka volume) can replicate to multiple peers and one
connection can be shared between multiple devices, we need separate objects to
represent devices on peer nodes and network connections.

As a first step to introduce multiple connections per device, give each
drbd_device object a single drbd_peer_device object which connects it to a
drbd_connection object.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:51 +01:00
Andreas Gruenbacher
bde89a9e15 drbd: Rename drbd_tconn -> drbd_connection
sed -i -e 's:all_tconn:connections:g' -e 's:tconn:connection:g'

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:44:47 +01:00
Andreas Gruenbacher
b30ab7913b drbd: Rename "mdev" to "device"
sed -i -e 's:mdev:device:g'

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:42:24 +01:00
Andreas Gruenbacher
5476169793 drbd: Rename struct drbd_conf -> struct drbd_device
sed -i -e 's:\<drbd_conf\>:drbd_device:g'

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:36:44 +01:00
Andreas Gruenbacher
a3603a6e3b drbd: Split off on-the-wire protocol definitions
Keep the protocol definitions separate from the kernel code; they are useful in
their own right.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:27:49 +01:00
Rashika Kheria
f63e631a34 drivers: block: Mark functions as static in drbd_receiver.c
Mark functions conn_wait_active_ee_empty() and
drbd_crypto_alloc_digest_safe() as static in drbd/drbd_receiver.c
because they are not used outside this file.

This eliminates the following warning in drbd/drbd_receiver.c:
drivers/block/drbd/drbd_receiver.c:1401:6: warning: no previous prototype for ‘conn_wait_active_ee_empty’ [-Wmissing-prototypes]
drivers/block/drbd/drbd_receiver.c:3259:21: warning: no previous prototype for ‘drbd_crypto_alloc_digest_safe’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2014-02-17 16:19:39 +01:00
Kent Overstreet
7988613b0e block: Convert bio_for_each_segment() to bvec_iter
More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.

This updates callers for the new usage without changing the
implementation yet.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Jim Paris <jim@jtan.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: support@lsi.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand <geoff@infradead.org>
2013-11-23 22:33:49 -08:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Lars Ellenberg
d2da5b0cb5 drbd: fix decoding of bitmap vli rle for device sizes > 64 TB
Symptoms: disconnect after bitmap exchange due to
bitmap overflow (e:49731075554) while decoding bm RLE packet

In the decoding step of the variable length integer run length encoding
there was potentially an uncatched bitshift by wordsize (variable >> 64).

The result of which is "undefined" :(
(only "sometimes" the result is the desired 0)

Fix: don't do any bit shift magic for shift == 64, just assign.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philipp Reisner
b874d231e1 drbd: Fix an connection drop issue after enabling allow-two-primaries
Since drbd-8.4.0 it is possible to change the allow-two-primaries
network option while the connection is established.

The sequence code used to partially order packets from the
data socket with packets from the meta-data socket, still assued
that the allow-two-primaries option is constant while the
connection is established.

I.e.
On a node that has the RESOLVE_CONFLICTS bits set, after enabling
allow-two-primaries, when receiving the next data packet it timed out
while waiting for the necessary packets on the data socket to arrive
(wait_for_and_update_peer_seq() function).

Fixed that by always tracking the sequence number, but only waiting
for it if allow-two-primaries is set.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:28 -07:00
Philipp Reisner
d752b26960 drbd: Allow online change of al-stripes and al-stripe-size
Allow to change the AL layout with an resize operation. For that
the reisze command gets two new fields: al_stripes and al_stripe_size.

In order to make the operation crash save:
1) Lock out all IO and MD-IO
2) Write the super block with MDF_PRIMARY_IND clear
3) write the bitmap to the new location (all zeros, since
   we allow only while connected)
4) Initialize the new AL-area
5) Write the super block with the restored MDF_PRIMARY_IND.
6) Unfreeze all IO

Since the AL-layout has no influence on the protocol, this operation
needs to be beforemed on both sides of a resource (if intended).

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Philipp Reisner
e96c96333f drbd: Constants should be UPPERCASE
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Andreas Gruenbacher
26ea8f9239 drbd: Do not sleep inside rcu
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-06-28 16:04:36 +02:00
Linus Torvalds
ebb3727779 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
2013-05-08 11:51:05 -07:00
Akinobu Mita
38b682b261 drbd: rename random32() to prandom_u32()
Use preferable function name which implies using a pseudo-random
number generator.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:42 -07:00
Philipp Reisner
3990e04df0 drbd: use sched_setscheduler()
It was unnoticed for some time that assigning to current->policy is
no longer sufficient to set a real time priority for a kernel thread.

Reported-by: Charlie Suffin <Charlie.Suffin@stratus.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-28 10:10:25 -06:00
Philipp Reisner
7c689e63a8 drbd: fix for deadlock when using automatic split-brain-recovery
With an automatic after split-brain recovery policy of
"after-sb-1pri call-pri-lost-after-sb",
when trying to drbd_set_role() to R_SECONDARY,
we run into a deadlock.

This was first recognized and supposedly fixed by
2009-06-10 "Fixed a deadlock when using automatic split brain recovery when both nodes are"
replacing drbd_set_role() with drbd_change_state() in that code-path,
but the first hunk of that patch forgets to remove the drbd_set_role().

We apparently only ever tested the "two primaries" case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-28 10:10:25 -06:00
Philipp Reisner
ef57f9e6bb drbd: Fix build error when CONFIG_CRYPTO_HMAC is not set
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-28 10:10:25 -06:00
Philipp Reisner
2d56a974f3 drbd: reset ap_in_flight counter for new connections
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-28 10:10:24 -06:00
Lars Ellenberg
56392d2f40 drbd: Clarify when activity log I/O is delegated to the worker thread
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:14:00 -06:00
Lars Ellenberg
cccac9857d drbd: mechanically rename la_size to la_size_sect
Make it obvious that this value is in units of 512 Byte sectors.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:13:59 -06:00
Philipp Reisner
13c76aba78 drbd: close race between drbd_set_role and drbd_connect
drbd_set_role(, R_PRIMARY, ) does the state change to Primary,
some more housekeeping, and possibly generates a new UUID set.

All of this holding the "state_mutex".

The connection handshake involves sending of various state information,
including the current data generation UUID set, and two connection
state changes from C_WF_CONNECTION to C_WF_REPORT_PARAMS further to
a number of different outcomes, resync being one of them.

If the connection handshake happens between the state change to Primary
and the generation of the new UUIDs, the resync decision based on the
old UUID set may be confused, depending on circumstances.

Make sure that, before we do the handshake, any promotion to Primary
role will either be complete (including the housekeeping stuff), or can
see, and serialize with, the ongoing handshake, based on the
"STATE_SENT" bit, which is set when we start the handshake, and cleared
only when we leave C_WF_REPORT_PARAMS again.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 13:00:33 +01:00
Lars Ellenberg
ed635cb067 drbd: if the replication link breaks during handshake, keep retrying
The 8.3.12 commit drbd: Bugfix for the connection behavior fixes a
"wasted established connection", if a former connection attempt failed
during its early stages.

However it opened a window for a regression, if a connection attempt
fails during its last stages.  The result was a terminated receiver
thread, that left behind the supposedly transient "C_UNCONNECTED" state.
Any later requests to change the connection state fail, as they wait for
the connection state to "stabilize".

Fix: short circuit and keep retrying to restablish a new connection,
if we don't reach C_WF_REPORT_PARAMS.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:22:19 +01:00
Jing Wang
063eacf88c drbd: check return of kmalloc in receive_uuids
Signed-off-by: Jing Wang <windsdaemon@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:22:10 +01:00
Philipp Reisner
986836503e Merge branch 'drbd-8.4_ed6' into for-3.8-drivers-drbd-8.4_ed6 2012-11-09 14:20:23 +01:00
Philipp Reisner
1393b59f8c drbd: Remove duplicate code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:38 +01:00
Philipp Reisner
19fffd7b03 drbd: Call drbd_md_sync() explicitly after a state change on the connection
Without this, the meta-data gets updates after 5 seconds by the
md_sync_timer. Better to do it immeditaly after a state change.

If the asender detects a network failure, it may take a bit until
the worker processes the according after-conn-state-change work item.

  The worker might be blocked in sending something, i.e. it
  takes until it gets into its timeout. That is 6 seconds by
  default which is longer than the 5 seconds of the md_sync_timer.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:08 +01:00
Lars Ellenberg
08332d7325 drbd: properly call drbd_rs_cancel_all() in drbd_disconnected()
drbd_disconnected() is supposed to clear the resync lru cache,
by calling drbd_rs_cancel_all().

We must do so before we call drbd_flush_workqueue(), as at least the
callback w_restart_disk_io() may wait for resync progres, and would
otherwise deadlock.

drbd_finish_peer_reqs() may again populate that cache, which will
then potentially be stale after the next resync handshake and bitmap
exchange, we have to do it again after that.

A stale resync lru cache causes no harm but ugly messages like this:
 BAD! sector=196608s enr=6 rs_left=-256 rs_failed=0 count=256 cstate=SyncTarget

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:19 +01:00
Philipp Reisner
155522df5b drbd: Remove dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:19 +01:00
Philipp Reisner
b66623e33e drbd: Avoid NetworkFailure state during disconnect
Disconnecting is a cluster wide state change. In case the peer node agrees
to the state transition, it sends back the fact on the meta-data connection
and closes both sockets.

In case the node node that initiated the state transfer sees the closing
action on the data-socket, before the P_STATE_CHG_REPLY packet, it was
going into one of the network failure states.

At least with the fencing option set to something else thatn "dont-care",
the unclean shutdown of the connection causes a short IO freeze or
a fence operation.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:18 +01:00
Philipp Reisner
39a1aa7f49 drbd: Protect accesses to the uuid set with a spinlock
There is at least the worker context, the receiver context, the context of
receiving netlink packts.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:04 +01:00
Andreas Gruenbacher
715306f69d drbd: Don't unregister socket state_change callback from within the callback
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:50 +01:00
Lars Ellenberg
427c0434fc drbd: disambiguation, s/DISCARD_CONCURRENT/RESOLVE_CONFLICTS/
We don't discard anything here, really.
We resolve conflicting, concurrent writes to overlapping data blocks.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:49 +01:00
Lars Ellenberg
d4dabbe22d drbd: disambiguation, s/P_DISCARD_WRITE/P_SUPERSEDED/
To avoid confusion with REQ_DISCARD aka TRIM, rename our
"discard concurrent write acks" from P_DISCARD_WRITE to P_SUPERSEDED.

At the same time, rename the drbd request event DISCARD_WRITE
to CONFLICT_RESOLVED. It already triggers both successful completion
or restart of the request, depending on our RQ_POSTPONED flag.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:49 +01:00
Lars Ellenberg
e959d08d3e drbd: Fix a potential issue with the DISCARD_CONCURRENT flag
The DISCARD_CONCURRENT flag should be set on one node and cleared on the
other node.
As the code was before it was theoretical possible that a node accepts the
meta socket, but has to close it later on, and keeps the DISCARD_CONCURRENT
flag.
Correct this by moving the clear_bit(DISCARD_CONCURRENT) where the packet
gets sent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:47 +01:00
Philipp Reisner
80c6eed49d drbd: More random to the connect logic
Since the listening socket is open all the time, it was possible to
get into stable "initial packet S crossed" loops.

* when both sides realize in the drbd_socket_okay() call at the end
  of the loop that the other side closed the main socket you had
  the chance to get into a stable loop with repeated "packet S crossed"
  messages.

* when both sides do not realize with the drbd_socket_okay() call at the end
  of the loop that the other side closed the main socket you had
  the chance to get into a stable loop with alternating "packet S crossed"
  "packet M crossed" messages.

In order to break out these stable loops randomize the behaviour if
such a crossing of P_INITIAL_DATA or P_INITIAL_META packets is detected.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:46 +01:00
Philipp Reisner
92f14951c0 drbd: Try to connec to peer only once per cycle
Since now our listening socket is open all the time we will get
connection tries of the peer always in. No need to try it three
times.

This is valid when connecting to older peers as well, it simply
increases the probability that the new version DRBD will accept
a connection instead that it will establish one.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:45 +01:00
Philipp Reisner
b666dbf819 drbd: Remove redundant and wrong test for NULL simplification in conn_connect()
Since the drbd_socket_okay() function itself tests if the the
socket is NULL, the explicit test "if (sock.socket && &msock.socket)"
was redundent.
Apart from that the address opperator ('&') before msock.socket rendered
the test pointless.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:45 +01:00
Lars Ellenberg
81a3537a97 drbd: announce FLUSH/FUA capability to upper layers
In 8.4, we may have bios spanning two activity log extents.
Fixup drbd_al_begin_io() and drbd_al_complete_io() to deal with zero sized bios.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:44 +01:00
Lars Ellenberg
58ffa580a7 drbd: introduce stop-sector to online verify
We now can schedule only a specific range of sectors for online verify,
or interrupt a running verify without interrupting the connection.

Had to bump the protocol version differently, we are now 101.
Added verify_can_do_stop_sector() { protocol >= 97 && protocol != 100; }

Also, the return value convention for worker callbacks has changed,
we returned "true/false" for "keep the connection up" in 8.3,
we return 0 for success and <= for failure in 8.4.
Affected: receive_state()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:32 +01:00
Lars Ellenberg
b6dd1a8976 drbd: remove struct drbd_tl_epoch objects (barrier works)
cherry-picked and adapted from drbd 9 devel branch

DRBD requests (struct drbd_request) are already on the per resource
transfer log list, and carry their epoch number. We do not need to
additionally link them on other ring lists in other structs.

The drbd sender thread can recognize itself when to send a P_BARRIER,
by tracking the currently processed epoch, and how many writes
have been processed for that epoch.

If the epoch of the request to be processed does not match the currently
processed epoch, any writes have been processed in it, a P_BARRIER for
this last processed epoch is send out first.
The new epoch then becomes the currently processed epoch.

To not get stuck in drbd_al_begin_io() waiting for P_BARRIER_ACK,
the sender thread also needs to handle the case when the current
epoch was closed already, but no new requests are queued yet,
and send out P_BARRIER as soon as possible.

This is done by comparing the per resource "current transfer log epoch"
(tconn->current_tle_nr) with the per connection "currently processed
epoch number" (tconn->send.current_epoch_nr), while waiting for
new requests to be processed in wait_for_work().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:35 +01:00
Lars Ellenberg
d5b27b01f1 drbd: move the drbd_work_queue from drbd_socket to drbd_connection
cherry-picked and adapted from drbd 9 devel branch
In 8.4, we don't distinguish between "resource work" and "connection
work" yet, we have one worker for both, as we still have only one connection.

We only ever used the "data.work",
no need to keep the "meta.work" around.

Move tconn->data.work to tconn->sender_work.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:34 +01:00
Philipp Reisner
7a426fd8d5 drbd: Keep the listening socket open while trying to connect to the peer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:31 +01:00
Philipp Reisner
1f3e509b76 drbd: pull prepare_listen_socket() out of drbd_wait_for_connect()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:31 +01:00
Philipp Reisner
7e0f096b8d drbd: Remove drbd_accept() and use kernel_accept() instead
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:30 +01:00
Philipp Reisner
2820fd3969 drbd: Move the call to listen() out of drbd_accept()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:29 +01:00
Lars Ellenberg
1882e22df7 drbd: grammar fix in log message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:29 +01:00
Philipp Reisner
3ea35df83f drbd: fix spelling, remove boring development log message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:27 +01:00
Philipp Reisner
e4bad1bcac drbd: Ensure that data_size is not 0 before using data_size-1 as index
This could be exploited by a peer which runs modified code.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:26 +01:00
Philipp Reisner
a1096a6e9d drbd: Delay/reject other state changes while establishing a connection
Changes to the role and disk state should be delayed or rejected
while we establish a connection.

This is necessary, since the peer will base its resync decision
on the UUIDs and the state we sent in the drbd_connect() function.

The most prominent example for this race is becoming primary after
sending state and UUIDs and before the state changes to C_WF_CONNECTION.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:26 +01:00
Philipp Reisner
27eb13e99b drbd: Fixed processing of disk-barrier, disk-flushes and disk-drain
Since drbd_bump_write_ordering() is called in the attaching
process while the disk state is D_ATTACHING, it was not
considering these three flags during attach.

A call to this function was missing form drbd_adm_disk_opts().

Fixed both issues.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:25 +01:00
Lars Ellenberg
9ed57dcbda drbd: ignore volume number for drbd barrier packet exchange
Transfer log epochs, and therefore P_BARRIER packets,
are per resource, not per volume.
We must not associate them with "some random volume".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:25 +01:00
Lars Ellenberg
2312f0b3c5 drbd: fix potential deadlock during "restart" of conflicting writes
w_restart_write(), run from worker context, calls __drbd_make_request()
and further drbd_al_begin_io(, delegate=true), which then
potentially deadlocks.  The previous patch moved a BUG_ON to expose
such call paths, which would now be triggered.

Also, if we call __drbd_make_request() from resource worker context,
like w_restart_write() did, and that should block for whatever reason
(!drbd_state_is_stable(), resource suspended, ...),
we potentially deadlock the whole resource, as the worker
is needed for state changes and other things.

Create a dedicated retry workqueue for this instead.

Also make sure that inc_ap_bio()/dec_ap_bio() are properly paired,
even if do_retry() needs to retry itself,
in case __drbd_make_request() returns != 0.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:21 +01:00
Lars Ellenberg
d93f63028f drbd: Fix a potential write ordering issue on SyncTarget nodes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:18 +01:00
Philipp Reisner
dd9b360475 drbd: Fix module refcount leak in drbd_accept()
drbd_accept was modelled after kernel_accept
with drbd commit 53eb779 in July 2008.

Only, kernel_accept was then broken, and only fixed later
with kernel commit 1b08534e in Dec 2008:
net: Fix module refcount leak in kernel_accept()

Impact: protocol families provided as modules, e.g. ipv6 or ib_sdp,
would soon have their reference count become negative, preventing
them from being unloaded (likely), or worse, hit zero without actually
being unused, allowing them to be unloaded while still in use (unlikely,
but if triggered, causing a kernel crash).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:16 +01:00
Philipp Reisner
4d0fc3fdc3 drbd: Fixed compat issue with disconnecting 8.4 from a primary 8.3
For compatibility reasons 8.4 has to send P_STATE_CHG_REQ (instead
of P_CONN_ST_CHG_REQ) when disconnecting.

In the receiving code path we missed to convert the old
answer (P_STATE_CHG_REPLY) back to 8.4 logic. Therefore
the CL_ST_CHG_SUCCESS or CL_ST_CHG_FAIL bit in the flags word
of mdev got set, while the state code was waiting for
the CONN_WD_ST_CHG_OKAY or CONN_WD_ST_CHG_FAIL bits in tconn.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:14 +01:00
Philipp Reisner
7da358625c drbd: Restore late assigning of tconn->data.sock and meta.sock
With commit from Mon Mar 28 16:33:12 2011 +0200
"drbd: drbd_connect(): Initialize struct drbd_socket before sending anything"

tconn->data.sock and tconn->meta.sock get assigned early, in
conn_connect.

The early assigning can trigger an OOPS, because it may released the socket
without acquiring the mutex protecting the socket. An other thread (worker)
might use setsockopt() on the socket while it gets free()ed.

Restored the (proven) 8.3 behavior of assigning these sockets after the two
connections are established.

Credits for reporting the issue are going to Arne Redlich.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:13 +01:00
Philipp Reisner
b8853dbd8c drbd: fix race between disconnect and receive_state
If the asender thread, or request_timer_fn(), or some other part of
the code, decided to drop the connection (because of timeout or other),
but the receiver just now was processing a P_STATE packet, there was a
chance that receive_state() would do a hard state change
"re-establishing" an already failed connection without additional handshake.

Log excerpt:
  Remote failed to finish a request within ko-count * timeout
  peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
  asender terminated
  ...
  peer( Unknown -> Secondary ) conn( Timeout -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
  ...
  Connection closed
  peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 )
  receiver terminated

Impact:
while the connection state is erroneously "Connected",
requests may be queued and even sent,
which would never be acknowledged,
and may have been missed by the cleanup.
These requests would never be completed.

The next drbd_suspend_io() will then lock up,
waiting forever for these requests to complete.

Fixed in several code paths:
  Make sure the connection state is NetworkFailure or worse
  before starting the cleanup in drbd_disconnect().
  This should make sure the cleanup won't miss any requests.

  Disallow receive_state() to "upgrade" the connection state
  from an error state. This will make sure the "illegal" state
  transition won't happen.

  For all connection failure states,
  relax the safe-guard in sanitize_state() again
  to silently mask out those state changes
  (e.g. Timeout -> Connected becomes Timeout -> Timeout).

 Note by Philipp Reisner:
  The 3rd chunk described as "relax the safe-guard..."
  is not there in 8.4 as it is relaxed to the maximum in
  8.4 already

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:12 +01:00
Philipp Reisner
380207d08e drbd: Load balancing of read requests
New config option for the disk secition "read-balancing", with
the values: prefer-local, prefer-remote, round-robin, when-congested-remote.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:10 +01:00
Philipp Reisner
d10b4ea32b drbd: Get rid of "ASSERTION FAILED: tconn->current_epoch->list not empty"
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:10 +01:00
Lars Ellenberg
615e087fbd drbd: add missing rcu locks around recently introduced idr_for_each
Recent commit
 drbd: Move write_ordering from mdev to tconn
introduced a new idr_for_each loop over all volumes,
but did not take necessary rcu locks or krefs.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:10 +01:00
Philipp Reisner
77fede5137 drbd: Fix the WO=drain implementation for multiple volumes
Wait until IO is drained in all volumes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:08 +01:00
Philipp Reisner
1e9dd2912e drbd: Switch drbd_may_finish_epoch() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:08 +01:00
Philipp Reisner
12038a3a71 drbd: Move list of epochs from mdev to tconn
This is necessary since the transfer_log on the sending is also
per tconn.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:08 +01:00
Philipp Reisner
1d2783d532 drbd: Prepare epochs per connection
An epoch object needs a pointer to the mdev it was received for.
This is necessary to be able to send the barrier ack packet for
the same volume as the original barrier packet was assigned to.

This prepares the next step, in which the (receiver side)
epoch list is moved from the device (mdev) to the connection (tconn)
object.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:07 +01:00
Philipp Reisner
4b0007c0e8 drbd: Move write_ordering from mdev to tconn
This is necessary in order to prepare the move of the (receiver side)
epoch list from the device (mdev) to the connection (tconn) objects.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:07 +01:00
Philipp Reisner
36baf6117b drbd: Fixed an obvious copy-n-paste mistake
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:06 +01:00
Philipp Reisner
43de7c852b drbd: Fixes from the drbd-8.3 branch
* drbd-8.3:
  drbd: O_SYNC gives EIO on ramdisks for some kernels (eg. RHEL6).
  drbd: send intermediate state change results to the peer

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:06 +01:00
Philipp Reisner
376694a054 drbd: Silenced compiler warnings
Since version 4.6.1 gcc warns about variables that get
a value assigned, but which are never read later on.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:05 +01:00
Philipp Reisner
9bcd252182 drbd: fix "stalled" empty resync
With sync-after dependencies, given "lucky" timing of pause/unpause
events, and the end of an empty (0 bits set) resync was sometimes not
detected on the SyncTarget, leading to a "stalled" SyncSource state.

Fixed this by expecting not only "Inconsistent -> UpToDate" but also
"Consistent -> UpToDate" transitions for the peer disk state
to end a resync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:04 +01:00
Philipp Reisner
08b165ba11 drbd: Consider the discard-my-data flag for all volumes [bugz 359]
...not only for the first volume

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:04 +01:00
Philipp Reisner
85d735138a drbd: Cleanup all epoch objects upon connection loss
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:01 +01:00
Philipp Reisner
823bd832a6 drbd: Bugfix for the connection behavior
If we get into the C_BROKEN_PIPE cstate once, the state engine set the
thi->t_state of the receiver thread to restarting.  But with the while loop
in drbdd_init() a new connection gets established. After the call into
drbdd() returns immediately since the thi->t_state is not RUNNING.  The
restart of drbd_init() then resets thi->t_state to RUNNING.

I.e. after entering C_BROKEN_PIPE once, the next successful established
connection gets wasted.

The two parts of the fix:
  * Do not cause the thread to restart if we detect the issue
    with the sockets while we are in C_WF_CONNECTION.

  * Make sure that all actions that would have set us to C_BROKEN_PIPE
    happen before the state change to C_WF_REPORT_PARAMS.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:59 +01:00
Andreas Gruenbacher
7d4c782cbd drbd: Fix the data-integrity-alg setting
The last data-integrity-alg fix made data integrity checking work when the
algorithm was changed for an established connection, but the common case of
configuring the algorithm before connecting was still broken.  Fix that.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:59 +01:00
Andreas Gruenbacher
accdbcc5f9 drbd: receive_protocol(): We cannot change our own data-integrity-alg setting here
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:56 +01:00
Andreas Gruenbacher
d505d9bef2 drbd: Be consistent in reporting incompatibilities in P_PROTOCOL settings
Refer to the settings by the names which drbdsetup and drbd.conf are using.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:56 +01:00
Andreas Gruenbacher
fbc12f4514 drbd: receive_protocol(): Make the program flow less confusing
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:55 +01:00
Andreas Gruenbacher
b792c35cfb drbd: receive_protocol(): Give variables more easily searchable names
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:55 +01:00
Andreas Gruenbacher
5af172ed9e drbd: Print memory address in hex instead of decimal in error message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:55 +01:00
Philipp Reisner
369bea6371 drbd: Fixed removal of volumes/devices from connected resources
When removing a volume/device we need to switch the connection
status of the peer back into WFReportParams.

  Before this fix it was left in Connected state. That means that
  the peer device continued to inform us about state changes, etc...
  But we deleted that minor -> protocol error.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:51 +01:00
Philipp Reisner
cdfda633d2 drbd: detach from frozen backing device
* drbd-8.3:
  documentation: Documented detach's --force and disk's --disk-timeout
  drbd: Implemented the disk-timeout option
  drbd: Force flag for the detach operation
  drbd: Allow new IOs while the local disk in in FAILED state
  drbd: Bitmap IO functions can not return prematurely if the disk breaks
  drbd: Added a kref to bm_aio_ctx
  drbd: Hold a reference to ldev while doing meta-data IO
  drbd: Keep a reference to the bio until the completion handler finished
  drbd: Implemented wait_until_done_or_disk_failure()
  drbd: Replaced md_io_mutex by an atomic: md_io_in_use
  drbd: moved md_io into mdev
  drbd: Immediately allow completion of IOs, that wait for IO completions on a failed disk
  drbd: Keep a reference to barrier acked requests

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:50 +01:00
Andreas Gruenbacher
2fcb8f307f drbd: Improve the "unexpected packet" error messages
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:50 +01:00