~luma_inhibitor/cables#84: 
Hydrus PTR integration

Integrate with the Hydrus network public tag repository to provide a file metadata lookup facility.

The Hydrus PTR API is undocumented, and it isn't obvious if accessing it via the Hydrus client is any different from using the client API. I've sent an email to the developer to see if they can shed some light on the matter.

I envision this ticket implementing the backend facility to carry our the API calls necessary to get tags back from a request with a file checksum. Later, if this is possible, we can tie that in with media scanning/blacklisting, preview features and search commands.

Status
REPORTED
Submitter
~erin
Assigned to
No-one
Submitted
2 years ago
Updated
2 years ago
Labels
feature/enhancement previews

~erin 2 years ago

The Hydrus developer responded to me. It's not possible to directly access the PTR server via HTTPS REST API, but we are able to make the requests for metadata that we'd like for this through the client API. The Hydrus client, when fully sync'd with the PTR, exposes an interface for making these lookups in its own cached and indexed copy of PTR metadata in its local database.

It'd be worthwhile to test this out locally first before trying to figure out how to get a local client running in the cloud. I'm unclear if hydrus-client can even be ran in headless mode, and there'd be other issues like disk access times and the sheer volume of file storage required. Presently, a fully up to date, indexed, and compacted DB is ~125GB, and it is nearly imperative that it reside on fast storage (SSD or better).

~erin 2 years ago*

This works, but the file has to be imported into the Hydrus client for it to expose all metadata through the API. Without doing this, the PTR tags will be available, but not dimensions, mimetype, etc.

Import file:

curl \
    -s -k -X POST \
    -H "Hydrus-Client-API-Access-Key: $API_KEY" \
    -H 'Content-Type: application/json' \
    -d '{"path": "/path/to/file.jpg"}' \
    https://127.0.0.1:45869/add_files/add_file \
    | jq '.'

Note, if the file cannot be imported, the API will respond with status code 4, and a traceback in the note field:

{
    "status": 4,
    "hash": "sha256-here",
    "note": "Traceback (most recent call last):\n  File \"/opt/hydrus/hydrus/client/networking/ClientLocalServerResources.py\", line 1079, in _threadDoPOSTJob\n    file_import_status = file_import_job.DoWork()\n  File \"/opt/hydrus/hydrus/client/importing/ClientImportFiles.py\", line 148, in DoWork\n    self.GenerateInfo( status_hook = status_hook )\n  File \"/opt/hydrus/hydrus/client/importing/ClientImportFiles.py\", line 287, in GenerateInfo\n    self._file_info = HydrusFileHandling.GetFileInfo( self._temp_path, mime = mime )\n  File \"/opt/hydrus/hydrus/core/HydrusFileHandling.py\", line 208, in GetFileInfo\n    raise HydrusExceptions.UnsupportedFileException( 'Unknown filetype!' )\nhydrus.core.HydrusExceptions.UnsupportedFileException: Unknown filetype!\n"
}

When imported successfully, status code 1 will be returned:

{
    "status": 1,
    "hash": "sha256-here",
    "note": ""
}

Get file metadata:

curl \
    -s -k -X GET \
    -H "Hydrus-Client-API-Access-Key: $API_KEY" \
    https://127.0.0.1:45869/get_files/file_metadata?hashes=%5B%22${FILE_SHA}%22%5D \
    | jq '.'

Parse metadata:

echo $CURL_RESPONSE | jq -r '
    .metadata | .[0] | {
        "mimetype": .mime,
        "width": .width,
        "height": .height,
        "duration": .duration,
        "has audio": .has_audio,
        "tags": .service_names_to_statuses_to_display_tags | ."public tag repository" | ."0"
    }'
{
  "mimetype": "image/jpeg",
  "width": 423,
  "height": 810,
  "duration": null,
  "has audio": false,
  "tags": [
    "gender:shemale",
    "rating:explicit"
  ]
}

For later implementation, the known_urls field may be interesting for providing a command that is essentially a reverse image search.

~erin 2 years ago

I've emailed the Hydrus developer back with some next step questions about running the client on a cloud server. There's indeed no headless mode that it can run in, but they do provide a Docker container that has VNC baked in without requiring an entire X11 server to be running. Unfortunately, I will have to make use of this to do initial configuration.

Looking over my local Hydrus client DB's files, I've determined we won't need as much space in production as I first thought. My own sqlite DB only takes up ~70GB, and that's including my own tags and file metadata (albeit, that amount would be minimal compared to the PTR)

Some tasks that will need to be performed that I can presently predict:

  1. Determine where we ought to run the bot in production - from AWS or a cheaper hoster, given additional storage requirmeents
  2. Extend the cloud instance's secondary storage to afford space for the additional ~70GB
  3. Transfer an export of PTR update files from my local workstation to the instance
  4. Extend the bot's Docker compose file to launch the modified local Hydrus client Docker container
  5. Launch the Hydrus client's Docker container through Docker compose in a cloud development environment
  6. Import and lengthy, compute intensive indexing of PTR updates into the client's sqlite database
  7. Connect to the Hydrus client container's VNC server for remote GUI access
  8. Enable client API (HTTPS not required since the bot will be accessing it from the same Docker network)
  9. Create a client API key for the bot to use, granting 'import files' and 'get metadata' permissions
  10. Extend the bot's Docker compose file and Dockerfile to pass in the Hydrus client API key to the application
  11. Reconfigure the cloud bot development environment configuration with the Hydrus client API key value
  12. Extend the bot application to optionally load a new cog for this feature set
  13. Program this new cog to make use of the Hydrus client API key, and provide internal interfaces to make the add file and get metadata API calls

An alternative to doing PTR update import and processing on the cloud instance, would be to:

  1. Download the PTR quicksync package locally (it has the PTR mappings already indexed, up to 2021-02)
  2. Export my up-to-date Hydrus client's PTR update files to fast storage
  3. Point my local Hydrus client installation to the quicksync DB, and launch the application
  4. Import only the PTR updates dated after the most recently indexed PTR mappings present in the DB
  5. Allow the client to do processing of PTR updates and index operations for the ~year it is missing
  6. Perform a compact DB maintenance operation from the Hydrus client
  7. Archive the resulting DB files on disk
  8. Transfer the archive to the client instance and extract it to the secondary media storage
  9. Specify the path to this DB the Docker container's DB volume
Register here or Log in to comment, or comment via email.