Hello,
I’m trying to build a homelab to run a bunch of daily batch processing jobs as it’s primary job. I used to do this on AWS but it was getting really expensive, so I currently do it on a Ryzen 5600 setup (it was on sale). On my current on-premise setup and even on AWS, I am only able to run a sample of the jobs I would like to do, and essentially use this sample to approximate the rest (not ideal).
Each job currently takes approximately 2 minutes and about 2.4 Mb memory. The job is simply a python script that reads data from a PostgreSQL database and does a type of multivariate linear approximation. It’s similar to an ML algorithm but it’s not suited for GPU processing (a lot of small matrices vs one big matrix). I need to run as many of these as possible on a daily basis. I would also like to larger analysis on an ad hoc basis, but this type of study is not time constrained.
So other than running the script, I would need it to be able to efficiently run the PostgreSQL database where the data is stored. I currently have a Fractal Node 804 case (mATX) that I’d like to re-use if possible. My only constraint is budget, which I don’t want to spend more than a couple thousand Canadian dollars on, though I’ll entertain more expensive options if they make sense. I am also in Canada.
Before I bite the bullet and buy a Threadripper, any ideas?
Thanks
Python itself is way too slow for any serious compute tasks, so I’m going to assume you’re already using NumPy to speed it up. Even so I’d look into rewriting it in some compiled language if feasible. There are CPU SIMD instructions for accelerating certain type of operations on matrices. Supposedly NumPy should be able to make use of them, but I don’t have any experience with that. Your 5600 for example supports 256-bit AVX2 operations.
In case you have already ruled out possible code/runtime optimizations and have concluded the only way forward is more cores, the cheapest upgrade would be upgrading to 5900/5950X for 12 or 16 cores.
To start, move the database to a different machine that has a fast ssd and lots of mem. If you’re workload is mostly doing reads from the db, consider breaking it into a single writer with one or more replicas.