If you read enough EXPLAIN ANALYZE output from PostgreSQL you start to get some ideas…
PostgreSQL makes estimates of queries before it dives into it. The same concept could be applied to Lemmy sorting orders and page numbers.
The whole project seem stuck on fringe cases. The reality is that the servers are running slow and overloading with routine fetches of Hot and Active post listings on page number 1, 2, or 3. Sure, some of the apps have speed scrolling to get to page 5 quickly… but there is too much concern out there for how page 970 behaves with a 5 month old posting… and the servers are crashing. Did I mention that the servers have been nonstop crashing since I came around in May?
Today I’m gong to experiment with doing what PostgreSQL does internally before picking an INDEX or a TABLE scan… some hints to the query where to go. For every sort order, there is a natural way to approach routine conventional paging… the trick is how to apply that to both single community and a blend of many communities.
Every combination of community block and subscribe is too much…
But can we leverage community_aggregates. With all the overhead of updating it on every ++ of post and ++ of comment… can we just write a hint of where page 20 in sorts may fall?
each community could set a dirty flag on community_aggregates update. count alone can’t be relied upon because it also decrements.
The hierarchy is important to consider. Every comment is owned by a post, every post is owned by a community, every community is owned by an instance. For Reddit, the user is secondary and their comment and post traditionally can be linked to and read even if the user deletes their account. For Lemmy, removal of content was an early design priority, but think that harms search engine use users and goes against the spirit of what a public forum is about (yes, everyone wants to turn it into private playland, and ignore that hackers break into major websites with full time security teams - and Lemmy being operated by unknown strangers isn’t exactly something they should count on).
Ok, back on topic. With all the overhead currently in place to update community)_aggregates on each new comment and post, it makes sense to try and filter out the cruft of 5000 communities that may be stale yet bog down every PostgreSQL for posts for the most recent hour. I don’t think I’ve seen one single mention by anyone analyzing the situation that lemmy.world having over 9,000 local communities is a huge factor - as Lemmy was sort of build with “20 communities” in 2020.
We know the sort order and the page number we are loading before we start building any SQL statement at all. Regardless of any customized user preferences, these are foundational in terms of filtering out data.
For /c/All - page number is useful. It is when you get into /c/Subscribed with a blend of communities that the trouble begins.
There is also the behavior people might expect with “Top 1 hour” and running out of posts to read. On that specific choice, they are likely to accept it… but with Hot and Active…
So what exactly are Hot and Active… and how far back in time do they go?
hot_rank: 1728,
hot_rank_active: 1728,
-- Note: 1728 is the result of the hot_rank function, with a score of 1, posted now
-- hot_rank = 10000*log10(1 + 3)/Power(2, 1.8)
If you read enough EXPLAIN ANALYZE output from PostgreSQL you start to get some ideas…
PostgreSQL makes estimates of queries before it dives into it. The same concept could be applied to Lemmy sorting orders and page numbers.
The whole project seem stuck on fringe cases. The reality is that the servers are running slow and overloading with routine fetches of Hot and Active post listings on page number 1, 2, or 3. Sure, some of the apps have speed scrolling to get to page 5 quickly… but there is too much concern out there for how page 970 behaves with a 5 month old posting… and the servers are crashing. Did I mention that the servers have been nonstop crashing since I came around in May?
Today I’m gong to experiment with doing what PostgreSQL does internally before picking an INDEX or a TABLE scan… some hints to the query where to go. For every sort order, there is a natural way to approach routine conventional paging… the trick is how to apply that to both single community and a blend of many communities.
Brainstorming…
Every combination of community block and subscribe is too much…
But can we leverage community_aggregates. With all the overhead of updating it on every ++ of post and ++ of comment… can we just write a hint of where page 20 in sorts may fall?
each community could set a dirty flag on community_aggregates update. count alone can’t be relied upon because it also decrements.
The hierarchy is important to consider. Every comment is owned by a post, every post is owned by a community, every community is owned by an instance. For Reddit, the user is secondary and their comment and post traditionally can be linked to and read even if the user deletes their account. For Lemmy, removal of content was an early design priority, but think that harms search engine use users and goes against the spirit of what a public forum is about (yes, everyone wants to turn it into private playland, and ignore that hackers break into major websites with full time security teams - and Lemmy being operated by unknown strangers isn’t exactly something they should count on).
Ok, back on topic. With all the overhead currently in place to update community)_aggregates on each new comment and post, it makes sense to try and filter out the cruft of 5000 communities that may be stale yet bog down every PostgreSQL for posts for the most recent hour. I don’t think I’ve seen one single mention by anyone analyzing the situation that lemmy.world having over 9,000 local communities is a huge factor - as Lemmy was sort of build with “20 communities” in 2020.
Brainstorming…
We know the sort order and the page number we are loading before we start building any SQL statement at all. Regardless of any customized user preferences, these are foundational in terms of filtering out data.
For /c/All - page number is useful. It is when you get into /c/Subscribed with a blend of communities that the trouble begins.
There is also the behavior people might expect with “Top 1 hour” and running out of posts to read. On that specific choice, they are likely to accept it… but with Hot and Active…
So what exactly are Hot and Active… and how far back in time do they go?
For post listing, these are the three that aren’t simple to grasp:
SortType::Active => query .then_order_by(post_aggregates::hot_rank_active.desc()) .then_order_by(post_aggregates::published.desc()), SortType::Hot => query .then_order_by(post_aggregates::hot_rank.desc()) .then_order_by(post_aggregates::published.desc()), SortType::Controversial => query.then_order_by(post_aggregates::controversy_rank.desc()),
lemmy@lemmy_alpha LOG: duration: 50.220 ms execute : WITH batch AS (SELECT a.id FROM post_aggregates a WHERE a.published > $1 AND (a.hot_rank != 0 OR a.hot_rank_active != 0) ORDER BY a.published LIMIT $2 FOR UPDATE SKIP LOCKED) UPDATE post_aggregates a SET hot_rank = hot_rank(a.score, a.published), hot_rank_active = hot_rank(a.score, a.newest_comment_time_necro) FROM batch WHERE a.id = batch.id RETURNING a.published; 2023-08-18 09:00:34.578 MST [1877420] lemmy@lemmy_alpha DETAIL: parameters: $1 = '2023-08-16 23:40:31.149267', $2 = '1000'
CREATE FUNCTION public.hot_rank(score numeric, published timestamp without time zone) RETURNS integer LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE AS $$ DECLARE hours_diff numeric := EXTRACT(EPOCH FROM (timezone('utc', now()) - published)) / 3600; BEGIN IF (hours_diff > 0) THEN RETURN floor(10000 * log(greatest (1, score + 3)) / power((hours_diff + 2), 1.8))::integer; ELSE RETURN 0; END IF; END; $$;
Difference between hot_rank and hot_rank_active
SET hot_rank = hot_rank(a.score, a.published), hot_rank_active = hot_rank(a.score, a.newest_comment_time_necro)"