Integrate with the Hydrus network public tag repository to provide a file metadata lookup facility.
The Hydrus PTR API is undocumented, and it isn't obvious if accessing it via the Hydrus client is any different from using the client API. I've sent an email to the developer to see if they can shed some light on the matter.
I envision this ticket implementing the backend facility to carry our the API calls necessary to get tags back from a request with a file checksum. Later, if this is possible, we can tie that in with media scanning/blacklisting, preview features and search commands.
The Hydrus developer responded to me. It's not possible to directly access the PTR server via HTTPS REST API, but we are able to make the requests for metadata that we'd like for this through the client API. The Hydrus client, when fully sync'd with the PTR, exposes an interface for making these lookups in its own cached and indexed copy of PTR metadata in its local database.
It'd be worthwhile to test this out locally first before trying to figure out how to get a local client running in the cloud. I'm unclear if
hydrus-client
can even be ran in headless mode, and there'd be other issues like disk access times and the sheer volume of file storage required. Presently, a fully up to date, indexed, and compacted DB is ~125GB, and it is nearly imperative that it reside on fast storage (SSD or better).
This works, but the file has to be imported into the Hydrus client for it to expose all metadata through the API. Without doing this, the PTR tags will be available, but not dimensions, mimetype, etc.
Import file:
curl \ -s -k -X POST \ -H "Hydrus-Client-API-Access-Key: $API_KEY" \ -H 'Content-Type: application/json' \ -d '{"path": "/path/to/file.jpg"}' \ https://127.0.0.1:45869/add_files/add_file \ | jq '.'Note, if the file cannot be imported, the API will respond with
status
code 4, and a traceback in thenote
field:{ "status": 4, "hash": "sha256-here", "note": "Traceback (most recent call last):\n File \"/opt/hydrus/hydrus/client/networking/ClientLocalServerResources.py\", line 1079, in _threadDoPOSTJob\n file_import_status = file_import_job.DoWork()\n File \"/opt/hydrus/hydrus/client/importing/ClientImportFiles.py\", line 148, in DoWork\n self.GenerateInfo( status_hook = status_hook )\n File \"/opt/hydrus/hydrus/client/importing/ClientImportFiles.py\", line 287, in GenerateInfo\n self._file_info = HydrusFileHandling.GetFileInfo( self._temp_path, mime = mime )\n File \"/opt/hydrus/hydrus/core/HydrusFileHandling.py\", line 208, in GetFileInfo\n raise HydrusExceptions.UnsupportedFileException( 'Unknown filetype!' )\nhydrus.core.HydrusExceptions.UnsupportedFileException: Unknown filetype!\n" }When imported successfully,
status
code 1 will be returned:{ "status": 1, "hash": "sha256-here", "note": "" }Get file metadata:
curl \ -s -k -X GET \ -H "Hydrus-Client-API-Access-Key: $API_KEY" \ https://127.0.0.1:45869/get_files/file_metadata?hashes=%5B%22${FILE_SHA}%22%5D \ | jq '.'Parse metadata:
echo $CURL_RESPONSE | jq -r ' .metadata | .[0] | { "mimetype": .mime, "width": .width, "height": .height, "duration": .duration, "has audio": .has_audio, "tags": .service_names_to_statuses_to_display_tags | ."public tag repository" | ."0" }'{ "mimetype": "image/jpeg", "width": 423, "height": 810, "duration": null, "has audio": false, "tags": [ "gender:shemale", "rating:explicit" ] }For later implementation, the
known_urls
field may be interesting for providing a command that is essentially a reverse image search.
I've emailed the Hydrus developer back with some next step questions about running the client on a cloud server. There's indeed no headless mode that it can run in, but they do provide a Docker container that has VNC baked in without requiring an entire X11 server to be running. Unfortunately, I will have to make use of this to do initial configuration.
Looking over my local Hydrus client DB's files, I've determined we won't need as much space in production as I first thought. My own
sqlite
DB only takes up ~70GB, and that's including my own tags and file metadata (albeit, that amount would be minimal compared to the PTR)Some tasks that will need to be performed that I can presently predict:
- Determine where we ought to run the bot in production - from AWS or a cheaper hoster, given additional storage requirmeents
- Extend the cloud instance's secondary storage to afford space for the additional ~70GB
- Transfer an export of PTR update files from my local workstation to the instance
- Extend the bot's Docker compose file to launch the modified local Hydrus client Docker container
- Launch the Hydrus client's Docker container through Docker compose in a cloud development environment
- Import and lengthy, compute intensive indexing of PTR updates into the client's
sqlite
database- Connect to the Hydrus client container's VNC server for remote GUI access
- Enable client API (HTTPS not required since the bot will be accessing it from the same Docker network)
- Create a client API key for the bot to use, granting 'import files' and 'get metadata' permissions
- Extend the bot's Docker compose file and
Dockerfile
to pass in the Hydrus client API key to the application- Reconfigure the cloud bot development environment configuration with the Hydrus client API key value
- Extend the bot application to optionally load a new cog for this feature set
- Program this new cog to make use of the Hydrus client API key, and provide internal interfaces to make the add file and get metadata API calls
An alternative to doing PTR update import and processing on the cloud instance, would be to:
- Download the PTR quicksync package locally (it has the PTR mappings already indexed, up to 2021-02)
- Export my up-to-date Hydrus client's PTR update files to fast storage
- Point my local Hydrus client installation to the quicksync DB, and launch the application
- Import only the PTR updates dated after the most recently indexed PTR mappings present in the DB
- Allow the client to do processing of PTR updates and index operations for the ~year it is missing
- Perform a compact DB maintenance operation from the Hydrus client
- Archive the resulting DB files on disk
- Transfer the archive to the client instance and extract it to the secondary media storage
- Specify the path to this DB the Docker container's DB volume