What I want for a Queue System

Posted on

I've been interested in queue systems since first learning about and using RabbitMQ in ... 2009 (woah... it's been a minute). What I've learned through most of this is that:

  • People won't care as much as me
  • You can't make people care
  • If they don't care then it's even easier to make mistakes.

Redis is quite popular as a queue system and I've joined multiple companies/teams where Redis and Python-RQ were used for async tasks. Redis is wonderful and is a great solution to many problems (including async tasks!) but in the cases I have seen, it's been mostly an incomplete, improper solution.

Google Pub/Sub is pretty wonderful generally and the pattern I love most is combining Pub/Sub and Cloud Run for HTTP delivery of events. There are some limitations with this pattern but I love most of all the removal of many problems developers can cause.

Event -> HTTP -> Service makes handling events much easier.

  • It's difficult to run tasks for hours from a single HTTP request
  • Handling of events requires little knowledge of a particular library
  • Much of the complexity doesn't need to be in the app
  • Removing the complexity from the app makes it easier for more apps to use it, without lots of work

I can't run Pub/Sub and Cloud Run at my house though.

What I want from a queue system

  • HTTP and/or GRPC submission of events to the queue system
  • HTTP and/or GRPC push to a service
  • Possible to run in a home environment, but not the-worst-idea in a larger environment
  • Back-pressure. When too many events are in the system the publishing will slow down.
  • Easy to run and not worry about it
  • Small idle footprint in memory/CPU
  • Horizontal scalability. If it's ever used in production somewhere, adding capacity should be easy.

What I don't need from a queue system

  • Super high throughput. 10k events per second is wonderful... but if it can do 100 and scales out, I'm not too worried.
  • Perfect durability. I'll assume that at some point data might be lost and those outliers are OK.
  • Perfect deliverability. I normally add end-to-end checks for data that a dropped event will only cause a delay, not a consistency problem.

What to do about it

I haven't found exactly what I'm looking for in other systems. Since I'm mostly scratching a self-hosting itch at the moment I'm looking to throw together a sample system to solve my problem, never expecting it to go beyond that (although maybe it will be useful for someone else?)

As I learn Rust, connecting Axum, Rust channels and Reqwest should get me pretty far.

And the real goal here is to use a simple enough system similar to what I'd recommend for production use cases with cloud services (and not my homegrown thing).


Prefetching Docker Images

Posted on

While running Nomad I've been running into a bootstrapping/critical path problem. I have a Docker Registry running in the cluster and pulling an image requires:

The Registry is required to serve the image.

Traefik routes the requests to the Registry as well as requesting Lets Encrypt certificates

GoCast announces the floating IP for Traefik

Minio stores the images for the Registry

Problems updating images

Separate from bootstrapping, just updating the image of many of these will require everything to already be running, just to pull the next image. There is an open bug to address this in Nomad, but it doesn't seem like it's going to be resolved anytime soon.

When updating Traefik I run into a condition that GoCast has created the floating IP addr on the host but Traefik isn't running. The floating IP won't work while Traefik is running-but-not-serving. GoCast BGP is working correctly in that the floating IP is not accounced to the network, but the updating host still can't reach the other-host instances of the floating IP. I'm not sure if leaving the addr in place is a feature or a bug.

A way around this would be to run multiple instances of Traefik on each host. As currently setup though I need to bind multiple instances of Traefik to the same ports and SO_REUSEPORT isn't supported. With GoCast I could map the floating IP ports to container ports and not require host networking (thus avoiding the port collision) but that may be quite burdensome to manage. I also haven't tried running multiple ports with GoCast NAT'ing.

Solving part of the problem

For the Traefik case of not being able to pull the image there are some workarounds. Manually pulling, or system batch jobs could solve this but is fairly manual.

regclient has a daemon mode that can pull/sync images to registries, but it doesn't support pushing to a Docker Engine.

Docker Prefetch Image

I've started on a tool to prefetch Docker images based on a config file. Updating the config file appropriately to match the image used in Nomad Jobs is still a problem. This uses the Docker Engine API via the Rust docker_api crate to pull the image to the host.

Nomad Consul Template though can populate the config file from Consul to avoid manual file updates thought which isn't terrible. I'm not sure if there is a nice way to integrate with the Nomad API to watch what images might be needed and pull the in in advance of any job using it.

This has solved my case for updating parts of the critical path of Docker Image hosting. It doesn't fully solve the bootstrapping case though where none of the services are running yet. An idea though is to extend the config/API calls to have the "expected" image tag Nomad would look for and a "source". If the "expected" image cannot be pulled, try the "source" and tag it locally as the "expected" tag. This would allow prefetching all images required for bootstrapping the system!


Nomad Events Logger

Posted on

As part of learning Rust I built a tool to read events from the Nomad Events API and log them to stdout. This allows an easy, low-resource way to pull Nomad cluster events into your log processing stream.

Low-resource as in ~4MB of memory for the Docker container!

Nomad Events Logger is deployable as a Docker image, and if I get around to it, a native binary as well. At the moment I run ~everything in Docker in my Nomad cluster so let me know if you want other formats.


Paperless-ngx Celery won't consume documents

Posted on

When running Paperless-ngx I ran into a problem where the Celery process in Docker (as part of supervisord) would start, supervisor would report it running, but the Celery process appeared to do nothing.

The last related lines I would see were:

INFO success: celery entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
[INFO] [paperless.management.consumer] Adding [REDACTED] to the task queue.

I'm not sure what part of Celery does this, maybe it's just Paperless? But eventually I found a .__celery.lock file in the Paperless data directory. Removing that allowed everything to work again.

This was likely caused with Nomad terminating the process and the lock file not getting cleaned up. I now have my Nomad job remove .*.lock files before starting Paperless.