This is the initial integration of the similarity metric into
the `git_diff_find_similar()` code path. The existing tests all
pass, but the new functionality isn't currently well tested. The
integration does go through the pluggable metric interface, so it
should be possible to drop in an alternative to the internal
metric that libgit2 implements.
This comes along with a behavior change for an existing interface;
namely, passing two NULLs to git_diff_blobs (or passing NULLs to
git_diff_blob_to_buffer) will now call the file_cb parameter zero
times instead of one time. I know it's strange that that change
is paired with this other change, but it emerged from some
initialization changes that I ended up making.
Previously the git_diff_delta recorded if the delta was binary.
This replaces that (with no net change in structure size) with
a full set of flags. The flag values that were already in use
for individual git_diff_file objects are reused for the delta
flags, too (along with renaming those flags to make it clear that
they are used more generally).
This (a) makes things somewhat more consistent (because I was
using a -1 value in the "boolean" binary field to indicate unset,
whereas now I can just use the flags that are easier to understand),
and (b) will make it easier for me to add some additional flags to
the delta object in the future, such as marking the results of a
copy/rename detection or other deltas that might want a special
indicator.
While making this change, I officially moved some of the flags that
were internal only into the private diff header.
This also allowed me to remove a gross hack in rename/copy detect
code where I was overwriting the status field with an internal
value.
This plugs in the three basic similarity strategies for handling
whitespace via internal use of the pluggable API. In so doing, I
realized that the use of git_buf in the hashsig API was not needed
and actually just made it harder to use, so I tweaked that API as
well.
Note that the similarity metric is still not hooked up in the
find_similarity code - this is just setting out the function that
will be used.
This moves the similarity metric code out of buf_text and into a
new file. Also, this implements a different approach to similarity
measurement based on a Rabin-Karp rolling hash where we only keep
the top 100 and bottom 100 hashes. In theory, that should be
sufficient samples to given a fairly accurate measurement while
limiting the amount of data we keep for file signatures no matter
how large the file is.
This makes the text similarity metric treat \r as equivalent
to \n and makes it skip whitespace immediately following a line
terminator, so line indentation will have less effect on the
difference measurement (and so \r\n will be treated as just a
single line terminator).
This also separates the text and binary hash calculators into
two separate functions instead of have more if statements inside
the loop. This should make it easier to have more differentiated
heuristics in the future if we so wish.
This adds a new `git_buf_text_hashsig` type and functions to
generate these hash signatures and compare them to give a
similarity score. This can be plugged into diff similarity
scoring.
This replaces most of the explicit vector iteration with calls
to git_vector_foreach, adds in some git__free and giterr_clear
calls to clean up during some error paths, and a couple of
other code simplifications.
The treebuilder entries vector flags removed items which means
we can't rely on the entries vector length to accurately get the
number of entries. This adds an entrycount value and maintains it
while updating the treebuilder entries.
The cppcheck static analyzer generates warnings for a bunch of
places in the libgit2 code base. All the ones fixed in this
commit are actually false positives, but I've reorganized the
code to hopefully make it easier for static analysis tools to
correctly understand the structure. I wouldn't do this if I
felt like it was making the code harder to read or worse for
humans, but in this case, these fixes don't seem too bad and will
hopefully make it easier for better analysis tools to get at any
real issues.
If gethostbyname() fails on platforms with NO_ADDRINFO, the code
leaks the struct addrinfo that was allocated. This fixes that
(and a number of code formatting issues in that area of code in
src/posix.c).
`git_diff_blobs` and `git_diff_blob_to_buffer` skip the step
where we check file attributes because they don't have a filename
associated with the data. Unfortunately, this meant they were also
skipping the check for the GIT_DIFF_FORCE_TEXT option and so you
could not force a diff of an apparent binary file. This adds the
force text check into their code path.
The callback will be called for each file, just before the `git_delta_t` gets inserted into the diff list.
When the callback:
- returns < 0, the diff process will be aborted
- returns > 0, the delta will not be inserted into the diff list, but the diff process continues
- returns 0, the delta is inserted into the diff list, and the diff process continues
Instead of returning directly the pattern as the return value, I used an
out parameter, because the function also tests if the passed pathspecs
vector is empty. If yes, it considers that the path "matches", but in
that case there is no matched pattern per se.
W/o this a libgit2 error message could have a mixed encoding:
e.g. a filename in UTF-8 combined with a native Windows error message
encoded with the local code page.
Signed-off-by: Sven Strickroth <email@cs-ware.de>
A leading slash confuses the name normalization code when the flags
include ALLOW_ONELEVEL. Catch this case in particular to avoid
triggering an assertion in the uppercase check which expects us not to
pass it an empty string.
The existing tests don't catch this as they simply use the NORMAL
flag.
This fixes#1300.
This adds a `git_diff_patch_line_stats()` API that gets the total
number of adds, deletes, and context lines in a patch. This will
make it a little easier to emulate `git diff --stat` and the like.
Right now, this relies on generating the `git_diff_patch` object,
which is a pretty heavyweight way to get stat information. At
some future point, it would probably be nice to be able to get
this information without allocating the entire `git_diff_patch`,
but that's a much larger project.
This is a new implementation of core git's config key checking
rules that prevents non-alphanumeric characters (and '-') for
the top-level section and key names inside of config files.
This also validates the target section name when renaming
sections.
OpenBSD's realpath(3) doesn't require the last part of the path to
exist. Override p_realpath in this OS to bring it in line with the
library's assumptions.
Check whether the backslash at the end of the line is being escaped or
not so as not to consider it a continuation marker when it's e.g. a
Windows-style path.
This is a convenience function to get the branch name of a given
ref. The returned branch name is compatible with the name that can
be supplied e.g. to git_branch_lookup(). That is, the prefixes
"refs/heads" or "refs/remotes" are omitted.
Also added a new test for testing the new function.
With the new code to make tree iterators support ignore_case,
there is a bug in setting the start entry for range bounded
iterators where memcmp was being used instead of strncasecmp.
This fixes that and expands the tree iterator test to cover
the cases that were broken.
The commit time is already stored as a git_time_t, but we were
parsing is as a uint32_t. This just switches the parser to use
uint64_t which will handle dates further in the future (and adds
some tests of those future dates).
When the encoding header changed to be treated as an additional
header, the EOL pointer started to point to the byte after the LF,
making the git__strndup call copy the LF into the value.
Increase the EOL pointer value after copying the data to keep the rest
of the semantics but avoid copying LF.
This moves the check for the "encoding" header into a loop which
is just scanning for non-required headers at the end of a commit
header. That loop will skip unrecognized lines (including header
continuation lines) until a terminating completely blank line is
found, and only then does it move to reading the commit message.
This makes tree iterators directly support case insensitivity by
using a secondary index that can be sorted by icase. Also, this
fixes the ambiguity check in the git_status_file API to also be
case insensitive. Lastly, this adds new test cases for case
insensitive range boundary checking for all types of iterators.
With this change, it should be possible to deprecate the spool
and sort iterator, but I haven't done that yet.
This adds a new external API git_tree_entry_cmp and a new internal
API git_tree_entry_icmp for sorting tree entries. The case
insensitive one is internal only because general users should
never be seeing case-insensitively sorted trees.
git__bsearch and git__tsort did not pass a payload through to the
comparison function. This makes it impossible to implement sorted
lists where the sort order depends on external data (e.g. building
a secondary sort order for the entries in a tree). This commit
adds git__bsearch_r and git__tsort_r versions that pass a third
parameter to the cmp function of a user payload.
This changes the iterator API so that flags can be passed in to
the constructor functions to control the ignore_case behavior.
At this point, the flags are not supported on tree iterators (i.e.
there is no functional change over the old API), but the API
changes are all made to accomodate this.
By the way, I went with a flags parameter because in the future
I have a couple of other ideas for iterator flags that will make
it easier to fix some diff/status/checkout bugs.
Returning GIT_EAMBIGUOUS from inside the status callback gets
overridden with GIT_EUSER. `git_status_file` accounted for this
via the callback payload, but was allowing the error message to
be cleared. Move the `giterr_set` call outside the callback to
where the EUSER case was being dealt with.
In preparation for further iterator changes, this cleans up a few
small things in the iterator API:
* removed the git_iterator_for_repo_index_range API
* made git_iterator_free not be inlined
* minor param name and test function name tweaks
Somewhat surprisingly, this can increase the speed considerably, as we
don't bother trying to decide what to evict, and the most used entries
are quickly back into the cache.
This is an intermin solution. While this essentially disables the
--shared flag feature, previously external templates did not work
at all. This change fixes the previously corrected, and since
then failing, repo_init__extended_with_template() test.
The problem is now documented in the source code comments.
The indexer needs to call the packfile's free function so it takes care of
freeing the caches.
We still need to close the mwf descriptor manually so we can rename the
packfile into its final name on Windows.