Verifying hashsums of objects we are reading from the ODB may be costly
as we have to perform an additional hashsum calculation on the object.
Especially when reading large objects, the penalty can be as high as
35%, as can be seen when executing the equivalent of `git cat-file` with
and without verification enabled. To mitigate for this, we add a global
option for libgit2 which enables the developer to turn off the
verification, e.g. when he can be reasonably sure that the objects on
disk won't be corrupted.
The upstream git.git project verifies objects when looking them up from
disk. This avoids scenarios where objects have somehow become corrupt on
disk, e.g. due to hardware failures or bit flips. While our mantra is
usually to follow upstream behavior, we do not do so in this case, as we
never check hashes of objects we have just read from disk.
To fix this, we create a new error class `GIT_EMISMATCH` which denotes
that we have looked up an object with a hashsum mismatch. `odb_read_1`
will then, after having read the object from its backend, hash the
object and compare the resulting hash to the expected hash. If hashes do
not match, it will return an error.
This obviously introduces another computation of checksums and could
potentially impact performance. Note though that we usually perform I/O
operations directly before doing this computation, and as such the
actual overhead should be drowned out by I/O. Running our test suite
seems to confirm this guess. On a Linux system with best-of-five
timings, we had 21.592s with the check enabled and 21.590s with the
ckeck disabled. Note though that our test suite mostly contains very
small blobs only. It is expected that repositories with bigger blobs may
notice an increased hit by this check.
In addition to a new test, we also had to change the
odb::backend::nonrefreshing test suite, which now triggers a hashsum
mismatch when looking up the commit "deadbeef...". This is expected, as
the fake backend allocated inside of the test will return an empty
object for the OID "deadbeef...", which will obviously not hash back to
"deadbeef..." again. We can simply adjust the hash to equal the hash of
the empty object here to fix this test.
We currently have no tests which check whether we fail reading corrupted
objects. Add one which modifies contents of an object stored on disk and
then tries to read the object.
The object::lookup tests do use the "testrepo.git" repository in a
read-only way, so we do not set up the repository as a sandbox but
simply open it. But in a future commit, we will want to test looking up
objects which are corrupted in some way, which requires us to modify the
on-disk data. Doing this in a repository without creating the sandbox
will modify contents of our libgit2 repository, though.
Create the repository in a sandbox to avoid this.
In the odb::backend::nonrefreshing test suite, we set up a fake backend
so that we are able to determine if backend functions are called
correctly. During the setup, we also parse an OID which is later on used
to read out the pseudo-object. While this procedure works right now, it
will create problems later when we implement hash verification for
looked up objects. The current OID ("deadbeef") will not match the hash
of contents we give back to the ODB layer and thus cannot be verified.
Make the hash configurable so that we can simply switch the returned for
single tests.
The threads::diff test suite has a static variable `_retries`, which is
used on Windows platforms only. As it is unused on other systems, the
compiler throws a warning there. Fix the warning by wrapping the
declaration in an ifdef.
In the function `git_filter_list_stream_data`, we initialize, write and
subesquently close the stream which should receive content processed by
the filter. While we skip writing to the stream if its initialization
failed, we still try to close it unconditionally -- even if the
initialization failed, where the stream might not be set at all, leading
us to segfault.
Semantics in this code is not really clear. The function handling the
same logic for files instead of data seems to do the right thing here in
only closing the stream when initialization succeeded. When stepping
back a bit, this is only reasonable: if a stream cannot be initialized,
the caller would not expect it to be closed again. So actually, both
callers of `stream_list_init` fail to do so. The data streaming function
will always close the stream and the file streaming function will not
close the stream if writing to it has failed.
The fix is thus two-fold:
- callers of `stream_list_init` now close the stream iff it has been
initialized
- `stream_list_init` now closes the lastly initialized stream if
the current stream in the chain failed to initialize
Add a test which segfaulted previous to these changes.
Whenever we rename a branch, we update the repository's symbolic HEAD
reference if it currently points to the branch that is to be renamed.
But with the introduction of worktrees, we also have to iterate over all
HEADs of linked worktrees to adjust them. Do so.
POSIX emulation retries should be configurable so that tests can disable
them. In particular, maniacally threading tests may end up trying to
open locked files and need retries, which will slow continuous
integration tests significantly.
It is possible to specify submodule URLs relative to the repository
location. E.g. having a submodule with URL "../submodule" will look for
the submodule at "repo/../submodule".
With the introduction of worktrees, though, we cannot simply resolve the
URL relative to the repository location itself. If the repository for
which a URL is to be resolved is a working tree, we have to resolve the
URL relative to the parent's repository path. Otherwise, the URL would
change depending on where the working tree is located.
Fix this by special-casing when we have a working tree while getting the
URL base.
References for a repository are usually created inside of its gitdir.
When using worktrees, though, these references are not to be created
inside the worktree gitdir, but instead inside the gitdir of its parent
repository, which is the commondir. Like this, branches will still be
available after the worktree itself has been deleted.
The filesystem refdb currently still creates new references inside of
the gitdir. Fix this and have it create references in commondir.
The three link files "worktree/.git", ".git/worktrees/<name>/commondir"
and ".git/worktrees/<name>/gitdir" should always contain absolute and
resolved paths. Adjust the logic creating new worktrees to first use
`git_path_prettify_dir` before writing out these files, so that paths
are resolved first.
The working tree's parent path should not point to the parent's gitdir,
but to the parent's working directory. Pointing to the gitdir would not
make any sense, as the parent's working directory is actually equal to
both repository's common directory.
Fix the issue.
While we already provide functionality to look up a worktree from a
repository, we cannot do so the other way round. That is given a
repository, we want to look up its worktree if it actually exists.
Getting the worktree of a repository is useful when we want to get
certain meta information like the parent's location, getting the locked
status, etc.
The function `diff_parsed_alloc` allocates and initializes a
`git_diff_parsed` structure. This structure also contains diff options.
While we initialize its flags, we fail to do a real initialization of
its values. This bites us when we want to actually use the generated
diff as we do not se the option's version field, which is required to
operate correctly.
Fix the issue by executing `git_diff_init_options` on the embedded
struct.
In a diff, the shortest possible hunk with a modification (that is, no
deletion) results from a file with only one line with a single character
which is removed. Thus the following hunk
@@ -1 +1 @@
-a
+
is the shortest valid hunk modifying a line. The function parsing the
hunk body though assumes that there must always be at least 4 bytes
present to make up a valid hunk, which is obviously wrong in this case.
The absolute minimum number of bytes required for a modification is
actually 2 bytes, that is the "+" and the following newline. Note: if
there is no trailing newline, the assumption will not be offended as the
diff will have a line "\ No trailing newline" at its end.
This patch fixes the issue by lowering the amount of bytes required.
The current logic of `git_diff_foreach` makes the assumption that all
diffs passed in are actually derived from generated diffs. With these
assumptions we try to derive the actual diff by inspecting either the
working directory files or blobs of a repository. This obviously cannot
work for diffs parsed from a file, where we do not necessarily have a
repository at hand.
Since the introduced split of parsed and generated patches, there are
multiple functions which help us to handle patches generically, being
indifferent from where they stem from. Use these functions and remove
the old logic specific to generated patches. This allows re-using the
same code for invoking the callbacks on the deltas.
An untracked file in a submodule should not prevent a rebase from
starting. Even if the submodule's SHA is changed, and that file would
conflict with a new tracked file, it's still OK to start the rebase
and discover the conflict later.
Signed-off-by: David Turner <dturner@twosigma.com>
When fsync'ing files, fsync the parent directory in the case where we
rename a file into place, or create a new file, to ensure that the
directory entry is flushed correctly.