While the function reading an object from the complete OID already
verifies OIDs, we do not yet do so for reading objects from a partial
OID. Do so when strict OID verification is enabled.
The read_prefix_1 function has several return statements springled
throughout the code. As we have to free memory upon getting an error,
the free code has to be repeated at every single retrun -- which it is
not, so we have a memory leak here.
Refactor the code to use the typical `goto out` pattern, which will free
data when an error has occurred. While we're at it, we can also improve
the error message thrown when multiple ambiguous prefixes are found. It
will now include the colliding prefixes.
Verifying hashsums of objects we are reading from the ODB may be costly
as we have to perform an additional hashsum calculation on the object.
Especially when reading large objects, the penalty can be as high as
35%, as can be seen when executing the equivalent of `git cat-file` with
and without verification enabled. To mitigate for this, we add a global
option for libgit2 which enables the developer to turn off the
verification, e.g. when he can be reasonably sure that the objects on
disk won't be corrupted.
The upstream git.git project verifies objects when looking them up from
disk. This avoids scenarios where objects have somehow become corrupt on
disk, e.g. due to hardware failures or bit flips. While our mantra is
usually to follow upstream behavior, we do not do so in this case, as we
never check hashes of objects we have just read from disk.
To fix this, we create a new error class `GIT_EMISMATCH` which denotes
that we have looked up an object with a hashsum mismatch. `odb_read_1`
will then, after having read the object from its backend, hash the
object and compare the resulting hash to the expected hash. If hashes do
not match, it will return an error.
This obviously introduces another computation of checksums and could
potentially impact performance. Note though that we usually perform I/O
operations directly before doing this computation, and as such the
actual overhead should be drowned out by I/O. Running our test suite
seems to confirm this guess. On a Linux system with best-of-five
timings, we had 21.592s with the check enabled and 21.590s with the
ckeck disabled. Note though that our test suite mostly contains very
small blobs only. It is expected that repositories with bigger blobs may
notice an increased hit by this check.
In addition to a new test, we also had to change the
odb::backend::nonrefreshing test suite, which now triggers a hashsum
mismatch when looking up the commit "deadbeef...". This is expected, as
the fake backend allocated inside of the test will return an empty
object for the OID "deadbeef...", which will obviously not hash back to
"deadbeef..." again. We can simply adjust the hash to equal the hash of
the empty object here to fix this test.
Error messages should be sentence fragments, and therefore:
1. Should not begin with a capital letter,
2. Should not conclude with punctuation, and
3. Should not end a sentence and begin a new one
Only provide the empty tree internally, which matches git's behavior.
If we provide the empty blob then any users trying to write it with
libgit2 would omit it from actually landing in the odb, which appear
to git proper as a broken repository (missing that object).
Move the delta application functions into `delta.c`, next to the
similar delta creation functions. Make the `git__delta_apply`
functions adhere to other naming and parameter style within the
library.
The old implementation had two issues:
1. OIDs that were too short as to be ambiguous were not being handled
properly.
2. If the last OID to expand in the array was missing from the ODB, we
would leak a `GIT_ENOTFOUND` error code from the function.
For most real use cases, repositories with alternates use them as main
object storage. Checking the alternate for objects before the main
repository should result in measurable speedups.
Because of this, we're changing the sorting algorithm to prioritize
alternates *in cases where two backends have the same priority*. This
means that the pack backend for the alternate will be checked before the
pack backend for the main repository *but* both of them will be checked
before any loose backends.
In the current implementation of ODB backends, each backend is tasked
with refreshing itself after a failed lookup. This is standard Git
behavior: we want to e.g. reload the packfiles on disk in case they have
changed and that's the reason we can't find the object we're looking
for.
This behavior, however, becomes pathological in repositories where
multiple alternates have been loaded. Given that each alternate counts
as a separate backend, a miss in the main repository (which can
potentially be very frequent in cases where object storage comes from
the alternate) will result in refreshing all its packfiles before we
move on to the alternate backend where the object will most likely be
found.
To fix this, the code in `odb.c` has been refactored as to perform the
refresh of all the backends externally, once we've verified that the
object is nowhere to be found.
If the refresh is successful, we then perform the lookup sequentially
through all the backends, skipping the ones that we know for sure
weren't refreshed (because they have no refresh API).
The on-disk pack backend has been adjusted accordingly: it no longer
performs refreshes internally.
As refdb and odb backends can be allocated by client code, libgit2
can’t know whether an alternative memory allocator was used, and thus
should not try to call `git__free` on those objects.
Instead, odb and refdb backend implementations must always provide
their own `free` functions to ensure memory gets freed correctly.
We currently first look in the loose object dir and then in the packs
for objects. When performing operations on recent history this has a
higher likelihood of hitting, but when we deal with operations which
look further back into the past, we start spending a large amount of
time getting ENOTENT from `access`.
Reversing the priorities means that long-running operations can get to
their objects faster, as we can look at the index data we have in memory
(or rather mapped) to figure out whether we have an object, which is
faster than going out to the filesystem.
The packed backend already implements an optimistic read algorithm by
first looking at the packs we know about and only going out to disk to
referesh if the object is not found which means that in the case where
we do have the object (which will be in the majority for anything that
traverses the graph) we can avoid going to to disk entirely to determine
whether an object exists.
Operations which look at recent history may take a slight impact, but
these would be operations which look a lot less at object and thus take
less time regardless.
Restricting files to size_t is a silly limitation. The loose backend
writes to a file directly, so there is no issue in using 63 bits for the
size.
We still assume that the header is going to fit in 64 bytes, which does
mean quite a bit smaller files due to the run-length encoding, but it's
still a much larger size than you would want Git to handle.
Make our overflow checking look more like gcc and clang's, so that
we can substitute it out with the compiler instrinsics on platforms
that support it. This means dropping the ability to pass `NULL` as
an out parameter.
As a result, the macros also get updated to reflect this as well.
This is a contract that we made in the library and that we need to uphold. The
contents of a blob can never be NULL because several parts of the library (including
the filter and attributes code) expect `git_blob_rawcontent` to always return a
valid pointer.
git hardocodes these as objects which exist regardless of whether they
are in the odb and uses them in the shell interface as a way of
expressing the lack of a blob or tree for one side of e.g. a diff.
In the library we use each language's natural way of declaring a lack of
value which makes a workaround like this unnecessary. Since git uses it,
it does however mean each shell application would need to perform this
check themselves.
This makes it common work across a range of applications and an issue
with compatibility with git, which fits right into what the library aims
to provide.
Thus we introduce the hard-coded empty blob and tree in the odb
frontend. These hard-coded objects are checked for before going to the
backends, but after the cache check, which means the second time they're
used, they will be treated as normal cached objects instead of creating
new ones.
The git_odb_exists_prefix API was not dealing correctly when a
later backend returned GIT_ENOTFOUND even if an earlier backend
had found the object.
Additionally, the unit tests were not properly exercising the API
and had a couple mistakes in checking the results.
Lastly, since the backends are not expected to behavior correctly
unless all bytes of the short id are zero except for the prefix,
this makes the ODB prefix APIs explicitly clear out the extra
bytes so the user doesn't have to be as careful.