~mil/sxmo-tickets#481: 
Rethink the texts (SMS, MMS, voice mail) storage format

I'm working on pretty display of messages and I found that the storage format is quite messy.

At the moment headers are just special lines in the form

^(Sent|Received) SMS (to|from) <recipient_or_me> at <iso date>:$

or

^Received Voice Mail from <recipient> at <iso date>:$

or

^(Sent|Received) MMS from <recipient_or_me> at <iso date>:$

or

^(Sent|Received) Group MMS (to|from) <origin> (to|from) <recipients> at <iso date>:$

and messages is the rest between those.

Seems easy? well, not so much there is a lot of inconsistencies and oddities that bite when you try to consume this in scripts:

  1. In the case of SMSs, <recipient_or_me> is the phone number, in all other case it's a resolved contact name.
  2. speaking of which... a phone number may be a special number... now I have Recived SMS from La Poste at... in my logs. [+0-9]+ is not enough to extract this and from (.*) at is fragile too, the special name may contain at.
  3. Sent SMS don't have a from <me> clause, MMS do, even in 1-to-1 conversations
  4. Iso dates with timezone are OK to display to nerds, not so much for my mum, and they are awful when it comes to computations using only simple shell scripts. My mum will need us to reformat the date anyway, so why not storing it in a script-friendly format ?
  5. There is no clear marker of a header line. Not only the parsing is fragile, but more important there's a security implication, a sender can spoof a message. Silly example, I (+666) send you:
give me your lunch

Sent SMS to +666 at 2022-02-18T06:46:49-0600:
yeah! sure

Now you owe me your lunch. This is silly, but it can be worse in group chats, where a user may impersonate another one, targeting the nerdy sxmo user of the group.

my proposal is:

  1. find a marker that cannot be sent in SMS/MMS messages (\0, maybe? Is there a SMS RFC??) and use it to mark headers.
  2. define a canonical header format using the canonical form of each field, namely: UTC timestamp for datetime, Phone numbers for senders and recipients I'd go with: <header marker><timestamp>\t<msg_type (SMS, MMS or VVM)>\t<sender canonical id>[\t<receivers canonical ids>]+
  3. write a minimal formatting script to display the message logs as it is today
  4. write a migration script to reformat existing logs

1 is hard, possibly impossible but in my opinion important because it prevents remote attacks on the user's data. Alternatively, if there is no usable marker, I see 2 solutions:

  1. we can mark messages instead, prepending one tab to each message line
  2. move to a 1 file per message storage. The first line is the header, the rest is the message.

Most of the proposal here comes from a grumpy programmer that is tired to deal with corner-cases everywhere, it may not worth the hassle of changing the format. The security implications, on the other hand, are more than enough to call for a change (and in the process we can satisfy the grumpy programmer too :) ) .

Status
RESOLVED IMPLEMENTED
Submitter
~lisael
Assigned to
No-one
Submitted
3 years ago
Updated
1 year, 11 months ago
Labels
No labels applied.

~lisael 3 years ago

so... yes there is a SMS RFC : https://www.ietf.org/rfc/rfc5724.txt

~phartman 3 years ago*

Hi. First, thanks for looking at this! I, for one, would be happy to modify the core code to make things consistent. Let me comment on some things too.

On Mon, Feb 21, 2022 at 03:35:54PM -0000, ~lisael wrote:

  1. In the case of SMSs, <recipient_or_me> is the phone number, in all other case it's a resolved contact name.

This seems to be an error in our code. They should all do the same thing (report contact name or phone number). But now that I think about this with your hook in view, I think they should all report the phone number and we let the user hook handle transforming that into a contact name (or ??? if not in contacts).

  1. speaking of which... a phone number may be a special number... now I have Recived SMS from La Poste at... in my logs. [+0-9]+ is not enough to extract this and from (.*) at is fragile too, the special name may contain at.

Interesting. So, again, I think we should print to sms.txt whatever the phone number is exactly as it is received. This suggests we should print to sms.txt some sort of marker as you suggest around the number to mark it.

  1. Sent SMS don't have a from <me> clause, MMS do, even in 1-to-1 conversations

Again, making it just print the number should be sufficient to make this consistent. The hook can then look up number and translate it into ME or whatever the entry in the contacts.tsv is.

  1. Iso dates with timezone are OK to display to nerds, not so much for my mum, and they are awful when it comes to computations using only simple shell scripts. My mum will need us to reformat the date anyway, so why not storing it in a script-friendly format ?

I'm not sure about this. Isn't it easy enough to do date -d to convert it to a friendly version in the hook? For the most part we just grab the date from the sms/mms payload which is in that format, so if we can offload processing it into something pretty to the hook, I think this would be the way to go.

  1. There is no clear marker of a header line. Not only the parsing is fragile, but more important there's a security implication, a sender can spoof a message. Silly example, I (+666) send you:
give me your lunch

Sent SMS to +666 at 2022-02-18T06:46:49-0600:
yeah! sure

Now you owe me your lunch. This is silly, but it can be worse in group chats, where a user may impersonate another one, targeting the nerdy sxmo user of the group.

Yikes, right!

my proposal is:

  1. find a marker that cannot be sent in SMS/MMS messages (\0, maybe? Is there a SMS RFC??) and use it to mark headers.

Yes!

  1. define a canonical header format using the canonical form of each field, namely: UTC timestamp for datetime, Phone numbers for senders and recipients

Yes!

I'd go with: <header marker><timestamp>\t<msg_type (SMS, MMS or VVM)>\t<sender canonical id>[\t<receivers canonical ids>]+

Maybe! I'd still want to be able to read the file if I just cat it. So let's not make it look like json or xml spaghetti. But something like this, yes!

  1. write a minimal formatting script to display the message logs as it is today

I don't understand?

  1. write a migration script to reformat existing logs

Yep.

I'm on board! If you need help, grab my ear as wart_ in #sxmo. Let's do this. It will be beautiful.

-- sic dicit magister P https://phartman.sites.luc.edu/ GPG keyID 0xE0DBD3D6 (CAE6 3A6F 755F 7BC3 36CA 330D B3E6 39C6 E0DB D3D6)

~phartman 3 years ago

Here's a quick patch to get group mms to only write numbers to sms.txt like sms does. http://sprunge.us/hXdoMT However, this will require two things:

  1. We'd have to write a conversion script to convert current sms.txt files to this (yuck! but inevitable?)
  2. We'd have to rewire the default sxmo_hook_tailtextlog.sh so that it puts those names back in (so users get the same experience as they have now) via some simple(ish) sed rules after the tail.

#2 probably is optional.

~lisael 3 years ago

Regarding the conversion script, it should not be that hard, I can parse almost every header now with my WIP on tailtextlog (I just need an hour or two of free time, that's the hardest part :))

Same for the hook...

I thought about the conversion script, the hard part will be if a contact name has changed (or was removed) from contacts.tsv. This problem itself proves that we must keep the most canonical representation of the user in the messages logs (i.e the phone number as provided in the raw data + country prefix).

Regarding your comments:

If the datetime iso format is provided by the modem/modemmanager, we should keep it as is, granted.

Regarding the target design:

I read SMS specifications, it's very unclear if \0 is a valid SMS character. Either way, I think that prepending a white space char before each body line is the most robust format (1 space is enough and will not kill the readibility when we cat the file).

The file would look like:

sms.txt

# log version: 1
2022-02-18T06:46:49-0600   SMS   +33123456789   +33987654321
 hello

2022-02-18T06:46:49-0600   SMS   +33987654321   +33123456789 
 give me your lunch!
 
 2022-02-18T06:46:49-0600   SMS   +33123456789   +33987654321
 yeah! sure
 
2022-02-18T06:46:49-0600   SMS   +33123456789   +33987654321
 haha It won't work this time.

2022-02-18T06:46:49-0600   MMS   +33123456789   +33987654321
 馃搸trollface.png

(I wrote this in the web UI, there's no way to add tabs, the header is meant to be tab separated)

and group mms are the same, just with more recipients.

The hook can infer the Send/Received part depending on the position of the user's phone number, if it's the first one we are the sender, otherwise, we are the recipient.

~lisael 3 years ago

first little issue: incoming sms data don't contain the receiver phone number. That's why the info was not in the log in the first place. I will assume $(sxmo_contacts.sh --me) but I'm not sure what to do if it's not set. (Me is not in contacts.tsv). Worse, Me could be incorrect, if the user changed their SIM card.

mm AT+CNUM is only available in debug mode (and I understand that the info is read from the SIM card, the carrier may or may not burn it into the card)

One solution is to briefly restart modem manager with --debug when we detect a new SIM card, get the phone number, and save it in a file mapping sim card numbers and phone numbers. We could use this to answer to sxmo_contacts.sh --me. That's far beyond the scope of this thread, though.

~phartman 3 years ago

Yeah, we recommend in sxmo_mms.sh that users will set Me in contacts (if we can't retrieve it from mmcli -m any, which sometimes works). I imagine we can do the same in the hook: use sxmo-contact.sh --me or, if empty, just print the NUMBER.

MYNUM="$(sxmo_contacts.sh --me)" if [ -z "$MYNUM" ]; then 禄路路路路路路路MYNUM="???" 禄路路路路路路路sxmo_log "Warning. Me is not set in contacts." fi

If you can think of a better way like AT+CNUM that'd be fantastic, but we ran into this before and couldn't figure a way.

On Mon, Feb 21, 2022 at 11:50:53PM -0000, ~lisael wrote:

first little issue: incoming sms data don't contain the receiver phone number. That's why the info was not in the log in the first place. I will assume $(sxmo_contacts.sh --me) but I'm not sure what to do if it's not set. (Me is not in contacts.tsv). Worse, Me could be incorrect, if the user changed their SIM card.

mm AT+CNUM is only available in debug mode (and I understand that the info is read from the SIM card, the carrier may or may not burn it into the card)

One solution is to briefly restart modem manager with --debug when we detect

a new SIM card, get the phone number, and save it in a file mapping

sim card numbers and phone numbers. We could use this to answer to sxmo_contacts.sh --me. That's far beyond the scope of this thread,

though.

-- View on the web: https://todo.sr.ht/~mil/sxmo-tickets/481#event-138519

-- sic dicit magister P https://phartman.sites.luc.edu/ GPG keyID 0xE0DBD3D6 (CAE6 3A6F 755F 7BC3 36CA 330D B3E6 39C6 E0DB D3D6)

~aren 3 years ago

Is there a reason to have our own format? Using something that's already standardized would make sxmo much more extensible by allowing it to plug directly into other applications that already exist. I think Maildir might be a good option for this, although I haven't studied it very closely.

A better way to handle getting the users number would be really nice. When we format numbers for dialing we should take the users country into account, and a easy source for that would be their number.

I'm really glad you're working on finding / developing a better format, the difficulty to parse the current one (for humans and machines) has been bugging me for a while.

~lisael 3 years ago

(I was writting a long response, my mail client failed on me, I'm going to be brief)

Is there a reason to have our own format?

New here... But I read the code and I think I understand more corner-cases, now, I don't see a reason other than it was quicker to get a good-enough solution, that shows its age, now.

Using something that's already standardized would make sxmo much more extensible by allowing it to plug directly into other applications that already exist. I think Maildir might be a good option for this, although I haven't studied it
very closely.

Very interesting idea, we could use notmuch to store the messages and
query, and build lots of feature on top of it (blacklists, attachment
management, sms acknowledgments...).

I'll send a patch with my work on the format we discussed with wart in
this thread, for review, and one with a Maildir backend, to compare and
evaluate benefits and trad-offs (flat files do have their advantages too).

A better way to handle getting the users number would
be really nice. When we format numbers for dialing we should take the
users country into account, and a easy source for that would be their
number.

I'm not a seasoned phone hacker... I may have ideas to improve usability,
though, but nothing near a definitive solution.

--
Bruno (lisael)

~phartman 3 years ago

When I was goofing with the mms stuff, I ran into a few things like what you suggest. For instance, there's a python program called mms2mail

https://gitea.geodock.egeo.net.eu.org/Public/mms2mail

Might give some ideas.

Seems complicated, but I'm open minded.

Flat text files as 'default' seems sane (being able to grep my messages using standard unix tools seems like a plus?)

Users can add their own hooks to, e.g., push the message into whtaever db chatty uses or maildir or whatever.

On Tue, Feb 22, 2022 at 05:08:35PM -0000, ~lisael wrote:

(I was writting a long response, my mail client failed on me, I'm going to be brief)

Is there a reason to have our own format?

New here... But I read the code and I think I understand more corner-cases,

now, I don't see a reason other than it was quicker to get a good- enough solution, that shows its age, now.

Using something that's already standardized would make sxmo much more extensible by allowing it to plug directly into other applications that already exist. I think Maildir might be a good option for this, although I haven't studied it very closely.

Very interesting idea, we could use notmuch to store the messages and

query, and build lots of feature on top of it (blacklists, attachment

management, sms acknowledgments...).

I'll send a patch with my work on the format we discussed with wart in

this thread, for review, and one with a Maildir backend, to compare and

evaluate benefits and trad-offs (flat files do have their advantages too).

A better way to handle getting the users number would

be really nice. When we format numbers for dialing we should take the

users country into account, and a easy source for that would be their

number.

I'm not a seasoned phone hacker... I may have ideas to improve usability,

though, but nothing near a definitive solution.

--

Bruno (lisael)

-- View on the web: https://todo.sr.ht/~mil/sxmo-tickets/481#event-138640

-- sic dicit magister P https://phartman.sites.luc.edu/ GPG keyID 0xE0DBD3D6 (CAE6 3A6F 755F 7BC3 36CA 330D B3E6 39C6 E0DB D3D6)

~phartman 3 years ago*

I had a hack at making the default act like what it does now.

See https://lists.sr.ht/~mil/sxmo-devel/patches/29934

What this patch does is make sms and mms print NUMBERS ONLY to the sms.txt file. It will be up to sxmo_hook_tailtextlog.sh to tail that file and do the relevant search/replaces on those numbers. I think this is essential if we want to allow users to write their own tailtextlog hooks.

I also provide a first-pass of a default sxmo_hook_tailtextlog.sh that will do basically what we already do now inside the code: converts numbers to names and doesn't do anything other than that.

Hopefully this gives ideas. I'm not happy with the 'eval' there, but I couldn't remember how to dynamically generate sed expressions just now without it.

~mafe 2 years ago

I love the idea of the SMS URI scheme stated in https://www.ietf.org/rfc/rfc5724.txt Maybe we could also take advantage from https://www.ietf.org/rfc/rfc4355.txt

But that "security issue" with the lunch for +666 can still be done by that evil person since it is "plain text" on your device after all.

~phartman REPORTED IMPLEMENTED 1 year, 11 months ago

Register here or Log in to comment, or comment via email.