~rjarry/aerc#65: 
Charset handling in filters

Hello,

I'm currently switching from neomutt to aerc, and here's one of the first problems I have: neomutt vs aerc

The mailcap entry used by neomutt is text/html; bwrap_auto.bash w3m -I %{charset} -T text/html --; copiousoutput and text/html=w3m -dump -T text/html for the aerc filter.

Since the filters command are actually passed to sh, why not just set an AERC_CONTENT_CHARSET environment variable? Or always convert the filter input to UTF-8 and document it.

Status
REPORTED
Submitter
~q3cpma
Assigned to
No-one
Submitted
11 days ago
Updated
11 days ago
Labels
No labels applied.

~rjarry 11 days ago

~q3cpma, Jul 30, 2022 at 09:44:

Since the filters command are actually passed to sh, why not just set an AERC_CONTENT_CHARSET environment variable? Or always convert the filter input to UTF-8 and document it.

I have been thinking about exporting a few variables to filters. Something like this should be a starting point:

diff --git a/widgets/msgviewer.go b/widgets/msgviewer.go
index eec786c507a9..dd37ddcee999 100644
--- a/widgets/msgviewer.go
+++ b/widgets/msgviewer.go
@@ -4,6 +4,7 @@ import (
        "bufio"
        "errors"
        "fmt"
+       "internal/syscall/execenv"
        "io"
        "os"
        "os/exec"
@@ -547,9 +548,8 @@ func NewPartViewer(acct *AccountView, conf *config.AercConfig,
        pager = exec.Command(cmd[0], cmd[1:]...)
 
        info := msg.MessageInfo()
+       mime := strings.ToLower(part.MIMEType) + "/" + strings.ToLower(part.MIMESubType)
        for _, f := range conf.Filters {
-               mime := strings.ToLower(part.MIMEType) +
-                       "/" + strings.ToLower(part.MIMESubType)
                switch f.FilterType {
                case config.FILTER_MIMETYPE:
                        if fnmatch.Match(f.Filter, mime, 0) {
@@ -578,6 +578,17 @@ func NewPartViewer(acct *AccountView, conf *config.AercConfig,
                }
        }
        if filter != nil {
+               env = execenv.Default(filter.SysProcAttr)
+               env = append(env, fmt.Sprintf("AERC_ACCOUNT=%s", acct.acct.Name))
+               env = append(env, fmt.Sprintf("AERC_CONTENT_TYPE=%s", mime))
+               if charset, found := part.Params["charset"]; fount {
+                       env = append(env, fmt.Sprintf("AERC_CONTENT_CHARSET=%s", charset))
+               }
+               if part.Encoding != "" {
+                       env = append(env, fmt.Sprintf("AERC_CONTENT_ENCODING=%s", part.Encoding))
+               }
+               filter.Env = env
+
                if pipe, err = filter.StdinPipe(); err != nil {
                        return nil, err
                }

Do you want to apply this, test it and submit patch?

Hadrien Lacour 11 days ago · edit

On Sat, Jul 30, 2022 at 03:37:24PM +0000, ~rjarry wrote:

~q3cpma, Jul 30, 2022 at 09:44:

Since the filters command are actually passed to sh, why not just set an AERC_CONTENT_CHARSET environment variable? Or always convert the filter input to UTF-8 and document it.

I have been thinking about exporting a few variables to filters. Something like this should be a starting point:

diff --git a/widgets/msgviewer.go b/widgets/msgviewer.go
index eec786c507a9..dd37ddcee999 100644
--- a/widgets/msgviewer.go
+++ b/widgets/msgviewer.go
@@ -4,6 +4,7 @@ import (
        "bufio"
        "errors"
        "fmt"
+       "internal/syscall/execenv"
        "io"
        "os"
        "os/exec"
@@ -547,9 +548,8 @@ func NewPartViewer(acct *AccountView, conf *config.AercConfig,
        pager = exec.Command(cmd[0], cmd[1:]...)

        info := msg.MessageInfo()
+       mime := strings.ToLower(part.MIMEType) + "/" + strings.ToLower(part.MIMESubType)
        for _, f := range conf.Filters {
-               mime := strings.ToLower(part.MIMEType) +
-                       "/" + strings.ToLower(part.MIMESubType)
                switch f.FilterType {
                case config.FILTER_MIMETYPE:
                        if fnmatch.Match(f.Filter, mime, 0) {
@@ -578,6 +578,17 @@ func NewPartViewer(acct *AccountView, conf *config.AercConfig,
                }
        }
        if filter != nil {
+               env = execenv.Default(filter.SysProcAttr)
+               env = append(env, fmt.Sprintf("AERC_ACCOUNT=%s", acct.acct.Name))
+               env = append(env, fmt.Sprintf("AERC_CONTENT_TYPE=%s", mime))
+               if charset, found := part.Params["charset"]; fount {
+                       env = append(env, fmt.Sprintf("AERC_CONTENT_CHARSET=%s", charset))
+               }
+               if part.Encoding != "" {
+                       env = append(env, fmt.Sprintf("AERC_CONTENT_ENCODING=%s", part.Encoding))
+               }
+               filter.Env = env
+
                if pipe, err = filter.StdinPipe(); err != nil {
                        return nil, err
                }

Do you want to apply this, test it and submit patch?

-- View on the web: https://todo.sr.ht/~rjarry/aerc/65#event-191586

Sadly, I get widgets/msgviewer.go:7:2: use of internal package internal/syscall/execenv not allowed when doing a make. Completely new to Go, by the way.

Hadrien Lacour referenced this from #65 11 days ago

~rjarry 11 days ago

The patch needed some changes. I have sent it on the mailing list:

https://lists.sr.ht/~rjarry/aerc-devel/patches/34308

Feel free to test it and comment.

~q3cpma 11 days ago

On Sat, Jul 30, 2022 at 04:34:36PM +0000, ~rjarry wrote:

The patch needed some changes. I have sent it on the mailing list:

https://lists.sr.ht/~rjarry/aerc-devel/patches/34308

The environment is set properly and the charset variable has the correct value, but it looks like the filter input is mangled from the beginning:

I use this script as HTML filter to debug:

#!/bin/sh -eu

tempfile=$(mktemp)
stdin=$(tee "$tempfile")
trap 'exit 1' HUP INT QUIT ABRT ALRM TERM
trap 'rm "$tempfile"' EXIT

{
	date
	echo 'STDIN:'
	echo "$stdin"
	echo "AERC_CONTENT_TYPE:    $AERC_CONTENT_TYPE"
	echo "AERC_CONTENT_CHARSET: $AERC_CONTENT_CHARSET"
	echo "uchardet result:      $(uchardet "$tempfile")"
	echo
} >>~/filter_html.log

echo "$stdin" | w3m -dump -I "$AERC_CONTENT_CHARSET" -T text/html

which gives me

Sat Jul 30 07:48:45 PM CEST 2022
STDIN:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="htt>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
...
AERC_CONTENT_TYPE:    text/html
AERC_CONTENT_CHARSET: iso-8859-1
uchardet result:      UTF-8

Let me investigate a bit what mutt gives with this.

~q3cpma 11 days ago

Using a mailcap with

text/html; env AERC_CONTENT_CHARSET=%{charset} ~/filter_html.sh

Neomutt gives me

...
AERC_CONTENT_TYPE:    empty
AERC_CONTENT_CHARSET: iso-8859-1
uchardet result:      ISO-8859-1

with actual Latin-1 content (e.g. 0xe9 for é).

PS: I guess it's fortunate that w3m -I '' works fine

~q3cpma 11 days ago

Using text/html=w3m -dump -I UTF-8 -T text/html works fine, so maybe we can assume UTF-8 as input always? w3m was originally choking on it because the header announced latin-1, I guess.

~konimarti referenced this from #65 11 days ago

~konimarti 11 days ago

On Sat Jul 30, 2022 at 8:07 PM CEST, ~q3cpma wrote:

Using text/html=w3m -dump -I UTF-8 -T text/html works fine, so maybe we can assume UTF-8 as input always? w3m was originally choking on it because the header announced latin-1, I guess.

Yes, I think that's correct. I'm almost a 100% that aerc always provides UTF-8 data to the filters since we use the go-message 0 package internally which automatically decodes the charset to UTF-8.

-- View on the web: https://todo.sr.ht/~rjarry/aerc/65#event-191643

~q3cpma 11 days ago

It may be wrong from a correctness point of view, though, as the HTML announced charset may now be false.

~q3cpma referenced this from #70 9 days ago

~konimarti referenced this from #70 9 days ago

Register here or Log in to comment, or comment via email.