~amolith/fediring#491: 
Lieu search engine is aggressively spidering my site.

Hello. My website (https://drwho.virtadpt.net/) is part of the Fediring. Earlier this week I noticed that one particular IP address (5.161.53.68 - fediring.net) is the source for about 63.7% of all of the web traffic on my site in the last 31 days.

I'm flattered that somebody's indexing me. However, that seems a little excessive. Is Lieu supposed to be that aggressive when it spiders sites? Or is my site just that big? I don't have any experience with it so I don't know if that's normal or not. Can you please advise?

Status
RESOLVED FIXED
Submitter
~drwho
Assigned to
No-one
Submitted
10 months ago
Updated
9 months ago
Labels
No labels applied.

~amolith 10 months ago

Hello o/

It's late here; please forgive my brief reply.

Lieu was set to crawl the entire fediring daily, and in hindsight, that's excessive. I've set it to monthly for now and will re-evaluate/reply tomorrow afternoon or evening.

Cheers

~amolith 10 months ago

~jbauer please feel free to weigh in if you have any thoughts

~jbauer 10 months ago

Reducing crawling frequency would be good. I do think once per day is a bit excessive for the size and frequency of updates of most sites on the ring, so we should go with something along the lines of once every two weeks or once every month (depending on what Lieu allows)

~amolith 10 months ago*

we should go with something along the lines of once every two weeks or once every month (depending on what Lieu allows)

Crawling and ingesting are just two commands that execute in sequence in a cronjob, so any cron expression is fine. It had just been @daily, but is now @monthly. There's a thread on fedi about it where the Lieu creator says they crawl the xxiivv ring manually every 3-4 months; based on that, I think leaving it at monthly sounds fine for now. It's not 3-4 months, but it's still a ~30x reduction in traffic ^^'

What do you think ~drwho?

~drwho 10 months ago

I think monthly recrawls make sense, ~amolith. Most sites don't change all that frequently and the ones that do tend to post their links before needing to search for them makes sense.

Is this a monthly full re-crawl, or is there a way to optimize it? Say, by paying attention to If-Modified-Since headers or HTTP 304 status returns?

~amolith 10 months ago

As far as I'm aware, it's a full re-crawl each time and there are no optimisations. Lieu's config docs don't really mention anything about traffic, mainly just general config and improving ingest heuristics for better results.

~jbauer REPORTED FIXED 9 months ago

Sounds like this problem has been solved! I'll go ahead and close this issue now.

Register here or Log in to comment, or comment via email.