commits
LineReader is a wrapper around StringReader that counts line numbers.
This is not currently used by the parser, but will be used by the apply
functions.
The tests were already split up in this way, so it makes sense to have
the parsing functions split as well.
These are fairly basic as the decoder is also exercised by the fragment
parsing tests, but they cover some errror cases that may not be covered
otherwise.
Git uses a unique Base85 encoding with different characters than the
ascii85 encoding implemented by Go, so add a custom decoding function.
Once decoded, use zlib instead of the raw DEFLATE algorithm to
decompress the data.
These issues were caught by some basic parsing tests which are added
here as well.
Parse the fragment header separately from the fragment chunk, which
makes each function a bit more understandable.
Parse forward and optionally reverse fragments, decoding and inflating
the ascii85 encoded data in a binary patch. This is completely untested
at the moment and probably has obvious and stupid bugs.
The binary marker is the text that appears where a text fragment
normally would and indicates that the file is binary. It's not quite a
header, because content is optional in a binary patch. If the patch does
include binary fragments, they have their own format, with a header.
Fragment is now TextFragment to distinguish from a future
BinaryFragment. Also rename FragmentLine to Line, since the
text-orientation is implied by the name.
This allows callers to provide patches with commit or email headers and
then retrieve that leading content for additional parsing without
having to identify where the first file in the patch starts.
While it's cool that this works, in a multi-file patch each file
immediately follow the final fragment of the previous file.
For simplicity, use the same fragments in both of these files and in the
single file test.
For now, this uses JSON to print objects on error, which is hard to
debug. It should probably use something like google/go-cmp instead,
because these objects are now too large for direct comparison.
Primarily check that leading non-header content is ignored and that
special errors (like detached fragment headers) are raised.
This should almost always be handled specially, so don't let tests that
expect errors hide the fact that they returned an io.EOF. This could be
revisited if the tests are ever updated to check for specific errors.
Covers that multiple fragments actually works and the unique error
contidions, but otherwise relies on the lower level tests for
correctness. Also update some comments and the README based on
observations from writing/debugging these tests.
These are about to get even longer, so it makes sense to separate them.
Also refactor the remaining parser tests to remove some duplication.
To make output better, also add some (temporary?) String() functions and
fix an assumption about the smallest fragment header I discovered was
wrong while looking at sample patches.
Correctly process "\ No newline..." marker lines when parsing fragments,
triming the new line from the last read line and advancing the parser.
Also fix the text chunk parser to follow the invariant of not advancing
past the end of the object in the event of an error. The error in
question now has a correct line number as well.
The parser now returns an EOF on the first call to Next() if the input
is empty and increments the line number before returning EOF for the
first time.
Binary patch support is still unimplemented, but the functions are
stubbed and the overall structure seems to make sense. This also renames
the existing fragment functions to have "Text" in their names (for
clarity) and moves the parsing of fragment lines to a new "chunk"
function, which will match how binary parsing works.
Track leading and trailing context lines because it's easy and Git
includes this information in the fragment type. Add validation for when
the fragment does not agree with the header or with the new/deleted
state of the file.
This should work, but does no validation on the fragment after parsing.
After considering fragment parsing, it made sense to change the
invariant estabilished in the previous commit. Specifically, parser
functions now assume they are call on the first line of their object and
return with the parser on the first line after their object. This means
code can call parse function immediately after each other without
advancing the parser in between.
It's possible this will change back later on... there seem to be
annoying edge cases with either choice, but I think making the functions
consistent is important.
After parsing a valid header, the current line of the parser should be
the last line of the header. If the header is invalid (the parse
function returns an error), the state of the parser is undefined.
Also add a test since this is easy to mess up.
Instead of checking that a line has a certain prefix before calling a
header parsing function, the functions now check this and return nil
objects when called on the wrong line type. If the line passes this
basic check but is still invalid, and error is returned as before.
At the moment, we need only to read three lines, but now the value is
easy to adjust as needed. I've only seen the Git implementation read two
lines ahead so far, but it has the whole input in memory and may read
more in other places.
Call Next() to advance the parser state until it returns a non-nil
error, then check if the error is io.EOF. This makes EOF handling easier
and also means that Line() and PeekLine() can be called multiple times
without changing state.
The next step is to update parse functions to return an internal marker
error if they are called on an invalid line. This should improve
correctness and remove the duplication of testing a condition and then
calling a parse function, which checks the same condition.
Also parse the header directly instead of using a regexp. This allows
finer-grained errors.
In certain cases, the error is generated from the peeked next line or a
line that was read in the past. The delta flag corrects for this,
producing accurate error messages.
Even though these are defined on *parser, it makes more sense to have
them in this file.
While I think this works, this is mostly for completeness. I (and I
expect most other users, if any) intend to use this with git patches.
Also fix a bug with EOF handling in the parsing function and make sure
the test does not ignore unexpected errors.
This avoids including the newline character as part of the name when it
is the last name on a line. Also split out functions for quoted and
unquoted names to clarify the logic.
The parsed OIDs are usually prefixes, not full hashes, so rename the
fields appropriately. Also note that values that are too long to be
valid OIDs are still accepted by the library.
This gets 100% test coverage for the functions involved in parsing
Git-style file headers. Whether that actually makes it correct remains
to be seen.
Fix parseGitHeaderIndex to allow abbreviated OIDs and add a note about
this to the README. The behavior might change again later, based on some
experimentation with Git.
Also convert all table tests to use maps instead of structs with name
fields and standardize other field names.
Adds "default" name handling, error checks for missing names, and prefix
and double-slash stripping.
This is slightly more code, but I think is a bit cleaner, especially for
functions that set or could set multiple values, like the currently
unimplemented index parsing. It also removes the string manipulation as
a side-effect of case match thing, which was weird.
Also change how error handling works for similarity score parsing.
I believe the structure is correct, but there are a lot of details to
fill in - translating the C string parsing logic into Go is not always
straightforward.
Git uses a unique Base85 encoding with different characters than the
ascii85 encoding implemented by Go, so add a custom decoding function.
Once decoded, use zlib instead of the raw DEFLATE algorithm to
decompress the data.
These issues were caught by some basic parsing tests which are added
here as well.
Correctly process "\ No newline..." marker lines when parsing fragments,
triming the new line from the last read line and advancing the parser.
Also fix the text chunk parser to follow the invariant of not advancing
past the end of the object in the event of an error. The error in
question now has a correct line number as well.
Binary patch support is still unimplemented, but the functions are
stubbed and the overall structure seems to make sense. This also renames
the existing fragment functions to have "Text" in their names (for
clarity) and moves the parsing of fragment lines to a new "chunk"
function, which will match how binary parsing works.
After considering fragment parsing, it made sense to change the
invariant estabilished in the previous commit. Specifically, parser
functions now assume they are call on the first line of their object and
return with the parser on the first line after their object. This means
code can call parse function immediately after each other without
advancing the parser in between.
It's possible this will change back later on... there seem to be
annoying edge cases with either choice, but I think making the functions
consistent is important.
Call Next() to advance the parser state until it returns a non-nil
error, then check if the error is io.EOF. This makes EOF handling easier
and also means that Line() and PeekLine() can be called multiple times
without changing state.
The next step is to update parse functions to return an internal marker
error if they are called on an invalid line. This should improve
correctness and remove the duplication of testing a condition and then
calling a parse function, which checks the same condition.