Error messages should be sentence fragments, and therefore:
1. Should not begin with a capital letter,
2. Should not conclude with punctuation, and
3. Should not end a sentence and begin a new one
The function to write trees allocates a new buffer for each tree.
This causes problems with performance when performing a lot
of actions involving writing trees, e.g. when doing many merges.
Fix the issue by instead handing in a shared buffer, which is then
re-used across the calls without having to re-allocate between
calls.
We look at whether we're trying to replace a blob with a tree during the
update phase, but we fail to look at whether we've just inserted a blob
where we're now trying to insert a tree.
Update the check to look at both places. The test for this was
previously succeeding due to the bu where we did not look at the sorted
output.
When parsing tree entries from raw object data, we do not verify
that the tree entry actually has a filename as well as a valid
object ID. Fix this by asserting that the filename length is
non-zero as well as asserting that there are at least
`GIT_OID_RAWSZ` bytes left when parsing the OID.
When we want to remove the file, use the basename as the name of the
entry to remove, instead of the full one, which includes the directories
we've inserted into the stack.
Instead of going through the usual steps of reading a tree recursively
into an index, modifying it and writing it back out as a tree, introduce
a function to perform simple updates more efficiently.
`git_tree_create_updated` avoids reading trees which are not modified
and supports upsert and delete operations. It is not as versatile as
modifying the index, but it makes some common operations much more
efficient.
Take advantage of the constant size of tree-owned arrays and store them
in an array instead of a pool. This still lets us free them all at once
but lets the system allocator do the work of fitting them in.
Submodules don't exist in the objectdb and the code is making us try to
look for a blob with its commit id, which is obviously not going to
work.
Skip the test if the user wants to insert a submodule.
When duplicating a `struct git_tree_entry` with
`git_tree_entry_dup` the resulting structure is not allocated
inside a memory pool. As we do a 1:1 copy of the original struct,
though, we also copy the `pooled` field, which is set to `true`
for pooled entries. This results in a huge memory leak as we
never free tree entries that were duplicated from a pooled
tree entry.
Fix this by marking the newly duplicated entry as un-pooled.
This reduces the size of the struct from 32 to 26 bytes, and leaves a
single padding byte at the end of the struct (which comes from the
zero-length array).
These are rather small allocations, so we end up spending a non-trivial
amount of time asking the OS for memory. Since these entries are tied to
the lifetime of their tree, we can give the tree a pool so we speed up
the allocations.
We've already looked at the filename with `memchr()` and then used
`strlen()` to allocate the entry. We already know how much we have to
advance to get to the object id, so add the filename length instead of
looking at each byte again.
Make our overflow checking look more like gcc and clang's, so that
we can substitute it out with the compiler instrinsics on platforms
that support it. This means dropping the ability to pass `NULL` as
an out parameter.
As a result, the macros also get updated to reflect this as well.
Path validation may be influenced by `core.protectHFS` and
`core.protectNTFS` configuration settings, thus treebuilders
can take a repository to influence their configuration.
An obvious place to fill the tree cache is on write-tree, as we're
guaranteed to be able to fill in the whole tree cache.
The way this commit does this is not the most efficient, as we read the
root tree from the odb instead of filling in the cache as we go along,
but it fills the cache such that successive operations (and persisting
the index to disk) will be able to take advantage of the cache, and it
reuses the code we already have for filling the cache.
Filling in the cache as we create the trees would require some
reallocation of the children vector, which is currently not possible
with out pool implementation. A different data structure would likely
allow us to perform this operation at a later date.
Keeping the cache around after read-tree is only one part of the
optimisation opportunities. In order to share the cache between program
instances, we need to write the TREE extension to the index.
Do so, taking the opportunity to rename 'entries' to 'entry_count' to
match the name given in the format description. The included test is
rather trivial, but works as a sanity check.
If the user wants to keep a copy for themselves, they should make a
copy. It adds unnecessary complexity to make sure the returned entries
are valid until the builder is cleared.
Finding a filename in a vector means we need to resort it every time we
want to read from it, which includes every time we want to write to it
as well, as we want to find duplicate keys.
A hash-map fits what we want to do much more accurately, as we do not
care about sorting, but just the particular filename.
We still keep removed entries around, as the interface let you assume
they were going to be around until the treebuilder is cleared or freed,
but in this case that involves an append to a vector in the filter case,
which can now fail.
The only time we care about sorting is when we write out the tree, so
let's make that the only time we do any sorting.
This updates the git_pqueue to simply be a set of specialized
init/insert/pop functions on a git_vector.
To preserve the pqueue feature of having a fixed size heap, I
converted the "sorted" field in git_vectors to a more general
"flags" field so that pqueue could mix in it's own flag. This
had a bunch of ramifications because a number of places were
directly looking at the vector "sorted" field - I added a couple
new git_vector helpers (is_sorted, set_sorted) so the specific
representation of this information could be abstracted.
This changes the behavior of callbacks so that the callback error
code is not converted into GIT_EUSER and instead we propagate the
return value through to the caller. Instead of using the
giterr_capture and giterr_restore functions, we now rely on all
functions to pass back the return value from a callback.
To avoid having a return value with no error message, the user
can call the public giterr_set_str or some such function to set
an error message. There is a new helper 'giterr_set_callback'
that functions can invoke after making a callback which ensures
that some error message was set in case the callback did not set
one.
In places where the sign of the callback return value is
meaningful (e.g. positive to skip, negative to abort), only the
negative values are returned back to the caller, obviously, since
the other values allow for continuing the loop.
The hardest parts of this were in the checkout code where positive
return values were overloaded as meaningful values for checkout.
I fixed this by adding an output parameter to many of the internal
checkout functions and removing the overload. This added some
code, but it is probably a better implementation.
There is some funkiness in the network code where user provided
callbacks could be returning a positive or a negative value and
we want to rely on that to cancel the loop. There are still a
couple places where an user error might get turned into GIT_EUSER
there, I think, though none exercised by the tests.
This continues auditing all the places where GIT_EUSER is being
returned and making sure to clear any existing error using the
new giterr_user_cancel helper. As a result, places that relied
on intercepting GIT_EUSER but having the old error preserved also
needed to be cleaned up to correctly stash and then retrieve the
actual error.
Additionally, as I encountered places where error codes were not
being propagated correctly, I tried to fix them up. A number of
those fixes are included in the this commit as well.