~whynothugo/pimsync#78: 
vparser should handle invalid utf-8 input

From rfc5545#section-3.1:

Note: It is possible for very simple implementations to generate improperly folded lines in the middle of a UTF-8 multi-octet sequence. For this reason, implementations need to unfold lines in such a way to properly restore the original sequence.

If a line is folded in the middle of a UTF-8 sequence, then it won't be valid UTF-8. I need to treat all input passed to vparser as a byte-sequence, not as a String.

Milestone 34a

Status
RESOLVED IMPLEMENTED
Submitter
~whynothugo
Assigned to
No-one
Submitted
8 months ago
Updated
25 days ago
Labels
0:bug 0:todo 3:vparser

~whynothugo 6 months ago

When reading from a file, a multi-octet sequence could be interrupted by a continuation line. This case no doubt needs to support non-String input.

In case of a utf-8 CalDav response, the content needs to be valid utf-8, so a sequence can't be interrupted. I think that a non-utf-8 byte could be escaped inside there, although I need to re-read the XML spec to ascertain this.

I have the impression that roxmltree might not fully handle escaped non-utf8 byte sequences. roxmltree::Node::text returns a String. This will require thorough testing.

~whynothugo 3 months ago*

A reproduction example is easiest achieved with a vdir storage.

Requires content-line-writer, tracked in issue 64.

There are currently three usages of vparser::Parser:

  • Calculating an item's hash: needs &[u8], so requires no changes.
  • replace_uid: require minor refactor (also: apparently unused right now?).
  • simple_component: requires larger, but isolated refactor.

All Item instances will have to be operated upon as Vec<u8> (or maybe just Bytes) instead of String. This might still have hidden risks.

~whynothugo REPORTED IMPLEMENTED 25 days ago

Implemented in the branch non-utf8: https://git.sr.ht/~whynothugo/vparser/log/non-utf8

Register here or Log in to comment, or comment via email.