~whereswaldon/arbor-dev#70: 
Improve filesystem layout of the Grove type to make Store operations faster

The Grove type implements the Store interface in the most naive possible way by storing all nodes as a flat hierarchy. This works, but makes all of the following operations really inefficient:

  • GetIdentity must scan every node in the directory (it can't tell which ones are identities
  • GetCommunity see above
  • GetConversation see above
  • GetReply see above
  • Children the hierarchy doesn't communicate any information about the tree relationships between nodes right now, so every node on disk is conceivably a child node of every other (we must check).
  • Recent must scan every node even though it only cares about a single type of node

The above could be addressed by storing nodes in a more structured hierarchy. Something like:

.
├── communities
│   └── SHA512_B32__QYRdxHQOnLTr_SD0u8nPIjAYi1YODWH05tlSDR9dAdQ
│       ├── self
│       ├── SHA512_B32__lVTlE-WJUJxtUtWPfafnudQk9oyHT2pgWEtBZXFjx4o
│       │   ├── self
│       │   ├── SHA512_B32__z89tiqtAmXWfN5l7-L18hFIQ4rXl_Rp4i98Cjyuj9xE
│       │   └── SHA512_B32__ZI9oIOdcUhV6uCF3wQFU4_XsJZmEaN_7TqWLn-Ob0TQ
│       └── SHA512_B32__z3n44lYtegNanse1iQTLbBG7r7cn9mjGjND17-6WdO0
│           ├── self
│           ├── SHA512_B32__zRHH06sN9QjEhAX4oZsp6jDubSKxXH-DXiqPoWHVRpk
│           └── SHA512_B32__zudIQfF_3pb3NyXJoOZjvTWP4TxAqa70OMkW3A_ErTE
├── .grove
│   └── version
└── identities
    ├── SHA512_B32__rs8L0sAjH2fE9anr5_Bu7Ijv6Cyig-H8N_WcoUcB3EQ
    └── SHA512_B32__uE9Nm7kZvJVJ1zdZYbsBsrn0BNX6rPhV_MQWape8Acw

All root nodes like identities and communities can be iterated quickly. Each community's data is stored in the self file in the directory named for the community ID. The directories inside of a community are conversations (top-level replies) and their data is in their own self files. Within each conversation are all of the participating nodes stored in a flat format.

I think a structure like this would address most of the performance issues that we face now. The .grove directory (like git's .git folder) is for us to store information like version numbers. This will allow us to iterate on how nodes are stored in a backwards-compatible way. The example file .grove/version would tell the Grove instance that opened it which schema was in use.

Status
REPORTED
Submitter
~whereswaldon
Assigned to
No-one
Submitted
10 months ago
Updated
10 months ago
Labels
feature forest-go help-wanted

~whereswaldon 10 months ago

I'm happy to discuss alternative filesystem layouts though, if anyone has ideas that they would like to explore.

Register here or Log in to comment, or comment via email.