~lioploum/offpunk#32: 
Unknown link types result in offpunk sync crashing

Unknown links cause offpunk to raise the following:

Traceback (most recent call last):rc://irc.libera.chat/#xxxx           
  File "/usr/sbin/offpunk", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1888, in main
    gc.call_sync(refresh_time=refresh_time,depth=depth,lists=args.url)
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1739, in call_sync
    fetch_list(l,validity=refresh_time,depth=depth,tourchildren=True)
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1713, in fetch_list
    fetch_url(l,depth=depth,validity=validity,savetotour=tourchildren,count=[counter,end])
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1703, in fetch_url
    fetch_url(k,depth=d,validity=0,savetotour=savetotour,\
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1703, in fetch_url
    fetch_url(k,depth=d,validity=0,savetotour=savetotour,\
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1703, in fetch_url
    fetch_url(k,depth=d,validity=0,savetotour=savetotour,\
  [Previous line repeated 1 more time]
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1696, in fetch_url
    links = r.get_links(mode=mode)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/ansicat.py", line 528, in get_links
    self._build_body_and_links(mode)
  File "/usr/lib/python3.11/site-packages/ansicat.py", line 515, in _build_body_and_links
    abs_l = urllib.parse.urljoin(self.url,l.split()[0])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/urllib/parse.py", line 551, in urljoin
    urlparse(url, bscheme, allow_fragments)
  File "/usr/lib/python3.11/urllib/parse.py", line 395, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/usr/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: '7' does not appear to be an IPv4 or IPv6 address
Status
RESOLVED FIXED
Submitter
~julianmarcos
Assigned to
No-one
Submitted
1 year, 13 days ago
Updated
1 year, 11 days ago
Labels
No labels applied.

~lioploum 1 year, 13 days ago

Could you test it with trunk? I think I’ve fixed a similar crash in the upcoming 2.1

~julianmarcos 1 year, 12 days ago

On Thu, 30 Nov 2023 22:24:20 +0000 "~lioploum" outgoing@sr.ht wrote:

Could you test it with trunk? I think I’ve fixed a similar crash in the upcoming 2.1 No, it seems like my problem from this test with trunk (c3aff6755e256fd977bd5073c5c2880f03d9c177)

https://tracker.debian.org[7]
Traceback (most recent call last):
  File "/usr/sbin/offpunk", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1897, in main
    gc.call_sync(refresh_time=refresh_time,depth=depth,lists=args.url)
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1747, in call_sync
    fetch_list(l,validity=refresh_time,depth=depth,tourchildren=True)
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1721, in fetch_list
    fetch_url(l,depth=depth,validity=validity,savetotour=tourchildren,count=[counter,end])
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1711, in fetch_url
    fetch_url(k,depth=d,validity=0,savetotour=savetotour,\
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1711, in fetch_url
    fetch_url(k,depth=d,validity=0,savetotour=savetotour,\
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1711, in fetch_url
    fetch_url(k,depth=d,validity=0,savetotour=savetotour,\
  [Previous line repeated 2 more times]
  File "/usr/lib/python3.11/site-packages/offpunk.py", line 1668, in fetch_url
    if not netcache.is_cache_valid(url,validity=validity):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/netcache.py", line 106, in is_cache_valid
    cache = get_cache_path(url)
            ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/netcache.py", line 135, in get_cache_path
    parsed = urllib.parse.urlparse(url)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/urllib/parse.py", line 395, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/usr/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: '7' does not appear to be an IPv4 or IPv6 address

(I've added a debugging call to fetch_url to print the url every time, so I had the data from fetch_url)

The URL gemini://rkta.srht.site/debbug-subscribe.gmi fails to be able to extract links as netcache seems to interpret the link

=> https://tracker.debian.org [7] https://tracker.debian.org

Erroneously, but that's not handled by offpunk but by urllib, it seems to be run something like the following.

urllib.parse.urlparse("https://tracker.debian.org[7]")

fetch_url() receives url as 'https://tracker.debian.org[7]' which is inherently incorrect I think. Is the page wrong?

It further seems to be called as Line 1710 in offpunk.py calls fetch_url, it is running with mode == "links_only", which means that when AbstractRender().get_links(mode) gets called (from the page which links to the different string) the function should only return the link, which should be https://tracker.debian.org.

I guess to fix this editing some things from lines in between 497 and 523 is needed.

~lioploum 1 year, 12 days ago

That’s a very good catch. This is indeed a problem with that particular page.

Fixed a crash when parsing hidden_urls bug #32

GemtextRenderer is parsing the text for URLs not starting with "=>" and adding them later to the list to avoid having to copy/paste with the mouse. This is an hidden feature.

In this case, the url was not supposed to be one and included [] chars which prevent urllib to know how to handle it.

The fix involved refactoring the looks_like_url functions out of offpunk and add it to offutils so it can be used by ansicat to ensure a function looks_like_url before giving it to urllib.

!resolve FIXED

~lioploum referenced this from #32 1 year, 12 days ago

~lioploum REPORTED FIXED 1 year, 12 days ago

~julianmarcos 1 year, 11 days ago

Referencing the commit that fixes this for posterity: https://git.sr.ht/~lioploum/offpunk/commit/316465835217744f560fe2cd68bc457c1fc998d6 Commit id 316465835217744f560fe2cd68bc457c1fc998d6

Thanks, I have finally synced offpunk on my computer without it crashing. (Yes, I have read the comments you did the moment you posted them, but I've been holding off until offpunk finishes so I don't have to send another comment if there's a bug related to the same field.) I have not had any other issues, thanks.

( Also, I guess I should have tried to fix the adding every fetched link to the tour part as offpunk has already put over 31000 links in there. I guess I accidentally almost made a archive of the world wide web, I should probably also request a feature to ignore certain links when fetching, (like upload.wikimedia.org or en.wikipedia.org) And yes, I think I'm a bit insane using --depth 5 and linking to a capsule list. I guess I should probably email the user-discussion list to make the proposals. )

~lioploum 1 year, 11 days ago

any --depth over 1 is insane ;-)

there’s already the option to ignore http links, allowing you to cache the gemini world.

offpunk --sync --disable-http

or you can disable images:

offpunk --sync --images-mode None (that one is new and not really well tested)

But, indeed, it would be best discussed on the list.

Register here or Log in to comment, or comment via email.