commits
In issue #57, it was reported that the library could not apply some
patches generated by BSD `diff` because of the way that variant reports
changes in files without trailing newlines. Since the library behavior
matches Git, I don't consider this a bug, but update the README to
mention that the support for standard unified diffs is only for the GNU
variant of the `diff` tool.
This worked, but completes the test coverage in combination with the
test that adds a final newline and one that leaves the missing newline
in place.
Include file names in the header (now that we can actually parse them)
and fix a bad find-and-replace that changed "differ" to "fmer". Add a
new test to verify that binary files without data format correctly.
Some patches may include one or more file paths as part of the binary
header when there is no binary data. Git accounts for this by only
checking the prefix and suffix of the line, but I missed that logic when
implementing this originally.
This enables clients to move back and forth between parsed objects and
text patches. The generated patches are semantically equal to the parsed
object and should re-parse to the same object, but may not be
byte-for-byte identical to the original input.
In my testing, formatted text patches are usually identical to the
input, but there may be cases where this is not true. Binary patches
always differ. This is because Go's 'compress/flate' package ends
streams with an empty block instead of adding the end-of-stream flag to
the last non-empty block, like Git's C implementation. Since the streams
will always be different for this reason, I chose to also enable default
compression (the test patches I generated with Git used no compression.)
The main tests for this feature involve parsing, formatting, and then
re-parsing a patch to make sure we get equal objects.
Formatting is handled by a new internal formatter type, which allows
writing all data to the same stream. This isn't exposed publicly right
now, but will be useful if there's a need for more flexible formatting
functions in the future, like formatting to a user-provided io.Writer.
The minimum Go version for the package is now Go 1.21. This is because a
future change will use the 'slices' package in test code. Note that non-test
code in the package should still be compatible with older versions of Go.
As part of this, also update the golangci-lint version to one that works
with Go 1.21, which required replacing some deprecated linters.
While empty patches with only a header were parsable, the parser
discarded the preamble content. This meant callers had to handle this
case specially. Now, if we reach the end of the input without finding a
file, Parse() returns the full content of the patch as the preamble.
When GitHub creates patches for Dependabot PRs, it generates a "From:"
line that is not valid according to RFC 5322: the address spec contains
unquoted special characters (the "[bot]" in "dependabot[bot]"). While
the 'net/mail' parser makes some exceptions to the spec, this is not one
of them, so parsing these patch headers fails.
Git's 'mailinfo' command avoids this by only implementing the unquoting
part of RFC 5322 and then applying a heuristic to separate the string in
to name and email values that seem reasonable.
This commit does two things:
1. Reimplements ParsePatchIdentity to follow Git's logic, so that it can
accept a wider range of inputs, including quoted strings. Strings
accepted by the previous implementation parse in the same way with
one exception: inputs that contain whitespace inside the angle
brackets for an email address now use the email address as the name
and drop any separate name component.
2. When parsing mail-formatted patches, use ParsePatchIdentity to parse
the "From:" line instead of the 'net/mail' function.
Git is actually more lenient here than I thought. As long as the
identity contains the "<>" delimiters, Git will allow an empty email, so
we should accept the same thing. I also discovered that an identity with
only an email set will use the email as the name, so I've implemented
that behavior as well.
If a patch is passed through a system that converts line endings to
'\r\n', mode lines end up with trailing whitespace that confuses
strconv.ParseInt. In Git, this is avoided by using strtoul() for parsing,
which stops at the first non-digit character.
Changing line endings in patch content itself will cause other problems
so it is best to avoid this transform, but if it does happen, it
shouldn't cause a parse error.
When processing mail-formatted patches, the default cleanup removed all
leading content in square brackets, but this pattern is often used to
identify tickets or other information that should remain in the commit
title. Git supports disabling this the the `-k` and `-b` flags, which we
simulate with the new SubjectCleanMode options.
Use WithSubjectCleanMode(SubjectCleanPatchOnly) to only remove bracketed
strings that contain "PATCH", keeping others that are (probably) part of
the actual commit message.
Note that because of the mail parsing library, we cannot replicate the
`-k` flag exactly and always clean leading and trailing whitespace.
Use standard string functions instead of removing leading whitespace in
a custom loop. This also makes the distinction between the two types of
mail header parsing a bit clearer.
The previous version of the golangci-lint action would install its own
version of Go, which eventually conflicted with the old pinned version
of the linter I was using. Upgrade the action to avoid this, but also
update Go and the linter while I'm here.
Remove the Applier type and replace it with TextApplier and
BinaryApplier, both of which operate on fragments instead of on full
files. Move the logic that previously existed in Applier.ApplyFile to
the top-level Apply function.
Also restructure arguments and methods to make it clear that appliers
are one-time-use objects. The destination is now set when creating an
applier and the Reset() method was replaced by Close().
Co-authored-by: Noah Goldstein <goldstein.w.n@gmail.com>
This fixes two issues:
1. Fragments with only context lines were accepted
2. Fragments with only a "no newline" marker caused a panic
The panic was found by go-fuzz. While fixing that issue, I
discovered the first problem with other types of empty fragments while
comparing behavior against 'git apply'.
Also extract some helper functions and modify the loop conditions to
clean things up while making changes in this area.
If a file was an exact multiple of 1024 bytes (the size of an internal
buffer) and was missing a final newline, the LineReaderAt implementation
would drop the last line, leading to an unexpected EOF error on apply.
In addition to fixing the bug, slightly change the behavior of
ReadLineAt to reflect how it is actually used:
1. Clarify that the return value n includes all lines instead of only
lines with a final newline. This was already true except in the
case of the bug fixed by this commit.
2. Only return io.EOF if fewer lines are read than requested. The
previous implementation also returned io.EOF if the last line was
missing a final newline, but this was confusing and didn't really
serve a purpose.
This is technically a breaking change for external implementations but
an implementation that exactly followed the "spec" was already broken in
certain edge cases.
Git does not quote file names containing spaces when generating header
lines. However, I misread the Git implementation and thought it used
spaces as a name terminator by default, implying that names with spaces
would be quoted.
Rewrite the the name parsing code to better match Git. Specifically,
when both names in a header are not quoted, test all splits to see if
any of them produce two equal names. Also account for spaces in file
names when parsing file metadata.
Finally, allow trailing spaces on file names, which seem to be allowed
by Git (although this sounds like a terrible idea to me.)
`git am` will accept patches which aren't fully RFC-compliant, as long
as they start with `From` and have a `Subject` line. Relax
ParsePatchHeader() so that it will accept such patches as well.
Specifically, it will tolerate patches of the form:
From: <foo@company.com>
In this case, the `Author` field will contain `foo@company.com
<foo@company.com>`. Duplicate this behavior.
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Co-authored-by: George Dunlap <george.dunlap@citrix.com>
...when parsing emails, similar to `git am`.
Add a new field, `BodyAppendix` to PatchHeader.
Modify `scanMessageBody` to accept a boolean argument saying whether
to separate out the appendix or not. Do this by keeping two string
builders, and having it switch to the appendix builder when it finds a
`---` line.
Handling the newlines at the end as expected requires moving things
around a bit.
First, we were trimming space from the line once to decide whether the
line was empty, and then trimming space again if we determined it
wasn't empty. This only needs to be done once.
Then, do all the trimming (both of whitespace and the prefix) first,
before deciding what to do about the line.
Request BodyAppendix separately when parsing a mail, but not a commit
message.
Add some tests to verify that it works as expected.
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Co-authored-by: George Dunlap <george.dunlap@citrix.com>
Primarily to get rid of [PATCH] at the front, but while we're here
just be generally compatible with `git am`:
* Remove `re` and variations
* Remove whitespace
* Remove anything in brackets
But only at the very beginning of the subject.
Store anything removed in this way in PatchHeader.SubjectPrefix.
Inspired by
https://github.com/git/git/blob/master/mailinfo.c:cleanup_subject()
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Co-authored-by: George Dunlap <george.dunlap@citrix.com>
Removed the distinction between parsed and raw dates and simply return
an error for unsupported date formats. Arbitrary date support was
probably a premature optimization and would be better supported by a
method to register custom date parsing functions.
Simple time.Time fields make the structure easier to use and mean that
a zero value always indicates the patch did not include a date.
A closer look at the Git source shows that the rfc2822 and default
formats do not zero-pad days, so we need to accept single-digit days
when parsing. The single-digit pattern will also accept the padded
format, so there's no loss of functionality.
Change the parsing test to use single-digit values to emphasize when
padding is required and when it is not.
When applying a text fragment that deletes all content in the file, make
sure there is no content left in the source after using all the lines in
the fragment. If there is is extra content, it's a conflict: something
added more lines to the file after generating the patch.
Git usually refers to the whole thing (title + body) as the "commit
message", so using different terms for the parts clarifies things. Also
add a new Message() function that returns the combined string.
This function parses patch headers (the preamble returned by the
existing Parse function) to extract information about the commit that
generated the patch. This is useful when patches are an interchange
format and this library is applying commits generated elsewhere.
Because of the variety of header formats, parsing is fairly lenient and
best-effort, although certain invalid input does cause errors.
This also extracts some test utilities from the apply tests for reuse.
Use golangci-lint to coordinate linting instead of installing tools
individually. The workflow installs the binary directly instead of using
one of the existing public actions.
copyFrom and copyLinesFrom did not increment the start offset after
reading, so the same lines were copied over and over again.
Also refactor the apply tests to use a common type now that there are
three variants that do almost the same thing for each test.
Applying a file now rejects future text fragment applies under the
assumption that all relevant fragments were part of the file.
Try to avoid confusion about whether variables refer to byte or line
counts/positions. It still isn't perfect, but I'm not sure how to
further clarify without being overly verbose.
This adds minimal new coverage, since the apply tests already exercise
this, but it's complicated enough that dedicated tests will be helpful.
This removes the distinction between "strict" and "fuzzy" application
by allowing future methods on Applier that control settings. It also
avoids state tracking in the text fragment apply signature by moving it
into the Applier type.
While in practice, an Applier will be used once and discarded, the
capability is provided to reset it for multiple uses.
This is no longer used for text application, so revert the parser back
to using StringReader.
This is functionally equivalent to the previous version (except for one
error case), but uses the new interface. I think the code is simpler
overall because it removes the line tracking.
This is a line-oriented parallel to io.ReaderAt, meant for text applies.
While the mapping isn't quite as clean as in the binary case, a text
apply still reads a fixed chunk of lines starting at a specific line
number and modifies them. This also allows a consistent interface for
strict and fuzzy applies.
The implementation wraps an io.ReaderAt and reads data in chunks,
indexing line boundaries as it goes. This is probably not the most
efficient way to implement this interface, but it works and allows file
application to take a consistent interface.
This better matches the actual contract of the delta patch, which reads
fixed size chunks of the source files at arbitrary positions.
These cover some of the possible errors with delta fragments. Many of
the same tests could be implemented slightly easier by testing the
helper functions, but the infrastructure is already set up to test the
full function.
This is made easier by the bin.go CLI, which can parse and encode binary
patch data. Once parsed, fragments were manipulated with truncate and
dd, encoded, and placed in the patch files.
This ensures a default size is used for large copy instructions.
Covers basic literal and delta application. Input files were created by
using dd and the 'conv=notrunc' option to modify sections of files
created from /dev/urandom. All file start with two null bytes to
convince Git they are binary.
Currently untested, but based on code I validated against some generated
patches. Tests comming in a future commit.
In issue #57, it was reported that the library could not apply some
patches generated by BSD `diff` because of the way that variant reports
changes in files without trailing newlines. Since the library behavior
matches Git, I don't consider this a bug, but update the README to
mention that the support for standard unified diffs is only for the GNU
variant of the `diff` tool.
This enables clients to move back and forth between parsed objects and
text patches. The generated patches are semantically equal to the parsed
object and should re-parse to the same object, but may not be
byte-for-byte identical to the original input.
In my testing, formatted text patches are usually identical to the
input, but there may be cases where this is not true. Binary patches
always differ. This is because Go's 'compress/flate' package ends
streams with an empty block instead of adding the end-of-stream flag to
the last non-empty block, like Git's C implementation. Since the streams
will always be different for this reason, I chose to also enable default
compression (the test patches I generated with Git used no compression.)
The main tests for this feature involve parsing, formatting, and then
re-parsing a patch to make sure we get equal objects.
Formatting is handled by a new internal formatter type, which allows
writing all data to the same stream. This isn't exposed publicly right
now, but will be useful if there's a need for more flexible formatting
functions in the future, like formatting to a user-provided io.Writer.
The minimum Go version for the package is now Go 1.21. This is because a
future change will use the 'slices' package in test code. Note that non-test
code in the package should still be compatible with older versions of Go.
As part of this, also update the golangci-lint version to one that works
with Go 1.21, which required replacing some deprecated linters.
When GitHub creates patches for Dependabot PRs, it generates a "From:"
line that is not valid according to RFC 5322: the address spec contains
unquoted special characters (the "[bot]" in "dependabot[bot]"). While
the 'net/mail' parser makes some exceptions to the spec, this is not one
of them, so parsing these patch headers fails.
Git's 'mailinfo' command avoids this by only implementing the unquoting
part of RFC 5322 and then applying a heuristic to separate the string in
to name and email values that seem reasonable.
This commit does two things:
1. Reimplements ParsePatchIdentity to follow Git's logic, so that it can
accept a wider range of inputs, including quoted strings. Strings
accepted by the previous implementation parse in the same way with
one exception: inputs that contain whitespace inside the angle
brackets for an email address now use the email address as the name
and drop any separate name component.
2. When parsing mail-formatted patches, use ParsePatchIdentity to parse
the "From:" line instead of the 'net/mail' function.
Git is actually more lenient here than I thought. As long as the
identity contains the "<>" delimiters, Git will allow an empty email, so
we should accept the same thing. I also discovered that an identity with
only an email set will use the email as the name, so I've implemented
that behavior as well.
If a patch is passed through a system that converts line endings to
'\r\n', mode lines end up with trailing whitespace that confuses
strconv.ParseInt. In Git, this is avoided by using strtoul() for parsing,
which stops at the first non-digit character.
Changing line endings in patch content itself will cause other problems
so it is best to avoid this transform, but if it does happen, it
shouldn't cause a parse error.
When processing mail-formatted patches, the default cleanup removed all
leading content in square brackets, but this pattern is often used to
identify tickets or other information that should remain in the commit
title. Git supports disabling this the the `-k` and `-b` flags, which we
simulate with the new SubjectCleanMode options.
Use WithSubjectCleanMode(SubjectCleanPatchOnly) to only remove bracketed
strings that contain "PATCH", keeping others that are (probably) part of
the actual commit message.
Note that because of the mail parsing library, we cannot replicate the
`-k` flag exactly and always clean leading and trailing whitespace.
Remove the Applier type and replace it with TextApplier and
BinaryApplier, both of which operate on fragments instead of on full
files. Move the logic that previously existed in Applier.ApplyFile to
the top-level Apply function.
Also restructure arguments and methods to make it clear that appliers
are one-time-use objects. The destination is now set when creating an
applier and the Reset() method was replaced by Close().
This fixes two issues:
1. Fragments with only context lines were accepted
2. Fragments with only a "no newline" marker caused a panic
The panic was found by go-fuzz. While fixing that issue, I
discovered the first problem with other types of empty fragments while
comparing behavior against 'git apply'.
Also extract some helper functions and modify the loop conditions to
clean things up while making changes in this area.
If a file was an exact multiple of 1024 bytes (the size of an internal
buffer) and was missing a final newline, the LineReaderAt implementation
would drop the last line, leading to an unexpected EOF error on apply.
In addition to fixing the bug, slightly change the behavior of
ReadLineAt to reflect how it is actually used:
1. Clarify that the return value n includes all lines instead of only
lines with a final newline. This was already true except in the
case of the bug fixed by this commit.
2. Only return io.EOF if fewer lines are read than requested. The
previous implementation also returned io.EOF if the last line was
missing a final newline, but this was confusing and didn't really
serve a purpose.
This is technically a breaking change for external implementations but
an implementation that exactly followed the "spec" was already broken in
certain edge cases.
Git does not quote file names containing spaces when generating header
lines. However, I misread the Git implementation and thought it used
spaces as a name terminator by default, implying that names with spaces
would be quoted.
Rewrite the the name parsing code to better match Git. Specifically,
when both names in a header are not quoted, test all splits to see if
any of them produce two equal names. Also account for spaces in file
names when parsing file metadata.
Finally, allow trailing spaces on file names, which seem to be allowed
by Git (although this sounds like a terrible idea to me.)
`git am` will accept patches which aren't fully RFC-compliant, as long
as they start with `From` and have a `Subject` line. Relax
ParsePatchHeader() so that it will accept such patches as well.
Specifically, it will tolerate patches of the form:
From: <foo@company.com>
In this case, the `Author` field will contain `foo@company.com
<foo@company.com>`. Duplicate this behavior.
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Co-authored-by: George Dunlap <george.dunlap@citrix.com>
...when parsing emails, similar to `git am`.
Add a new field, `BodyAppendix` to PatchHeader.
Modify `scanMessageBody` to accept a boolean argument saying whether
to separate out the appendix or not. Do this by keeping two string
builders, and having it switch to the appendix builder when it finds a
`---` line.
Handling the newlines at the end as expected requires moving things
around a bit.
First, we were trimming space from the line once to decide whether the
line was empty, and then trimming space again if we determined it
wasn't empty. This only needs to be done once.
Then, do all the trimming (both of whitespace and the prefix) first,
before deciding what to do about the line.
Request BodyAppendix separately when parsing a mail, but not a commit
message.
Add some tests to verify that it works as expected.
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Co-authored-by: George Dunlap <george.dunlap@citrix.com>
Primarily to get rid of [PATCH] at the front, but while we're here
just be generally compatible with `git am`:
* Remove `re` and variations
* Remove whitespace
* Remove anything in brackets
But only at the very beginning of the subject.
Store anything removed in this way in PatchHeader.SubjectPrefix.
Inspired by
https://github.com/git/git/blob/master/mailinfo.c:cleanup_subject()
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Co-authored-by: George Dunlap <george.dunlap@citrix.com>
Removed the distinction between parsed and raw dates and simply return
an error for unsupported date formats. Arbitrary date support was
probably a premature optimization and would be better supported by a
method to register custom date parsing functions.
Simple time.Time fields make the structure easier to use and mean that
a zero value always indicates the patch did not include a date.
A closer look at the Git source shows that the rfc2822 and default
formats do not zero-pad days, so we need to accept single-digit days
when parsing. The single-digit pattern will also accept the padded
format, so there's no loss of functionality.
Change the parsing test to use single-digit values to emphasize when
padding is required and when it is not.
This function parses patch headers (the preamble returned by the
existing Parse function) to extract information about the commit that
generated the patch. This is useful when patches are an interchange
format and this library is applying commits generated elsewhere.
Because of the variety of header formats, parsing is fairly lenient and
best-effort, although certain invalid input does cause errors.
This also extracts some test utilities from the apply tests for reuse.
This removes the distinction between "strict" and "fuzzy" application
by allowing future methods on Applier that control settings. It also
avoids state tracking in the text fragment apply signature by moving it
into the Applier type.
While in practice, an Applier will be used once and discarded, the
capability is provided to reset it for multiple uses.
This is a line-oriented parallel to io.ReaderAt, meant for text applies.
While the mapping isn't quite as clean as in the binary case, a text
apply still reads a fixed chunk of lines starting at a specific line
number and modifies them. This also allows a consistent interface for
strict and fuzzy applies.
The implementation wraps an io.ReaderAt and reads data in chunks,
indexing line boundaries as it goes. This is probably not the most
efficient way to implement this interface, but it works and allows file
application to take a consistent interface.
These cover some of the possible errors with delta fragments. Many of
the same tests could be implemented slightly easier by testing the
helper functions, but the infrastructure is already set up to test the
full function.
This is made easier by the bin.go CLI, which can parse and encode binary
patch data. Once parsed, fragments were manipulated with truncate and
dd, encoded, and placed in the patch files.