Totally fair point — at the end of the day, it's all about getting the best model performance. I was mostly trying to highlight how, under the hood, a lot of modern HPO algos really boil down to smart scheduling decisions.
Pickle still is good for custom objects (JSON loses methods and also order), Graphs & circular refs (JSON breaks), Functions & lambdas (Essential for ML & distributed systems) and is provided out of box.
We're contemplating protocols that don't evaluate or run code; that rules out serializing functions or lambdas (i.e., code).
Custom objects in Python don't have "order" unless they're using `__slots__` - in which case the application already knows what they are from its own class definition. Similarly, methods don't need to be serialized.
A general graph is isomorphic to a sequence of nodes plus a sequence of vertex definitions. You only need your own lightweight protocol on top.
Because globals(), locals(), Classes and classInstances are backed by dicts, and dicts are insertion ordered in CPython since 3.6 (and in the Python spec since 3.7), object attributes are effectively ordered in Python.
Object instances with __slots__ do not have a dict of attributes.
__slots__ attributes of Python classes are ordered, too.
Are graphs isomorphic if their nodes and edges are in a different sequence?
assert dict(a=1, b=2) == dict(b=2, a=1)
from collections import OrderedDict as odict
assert dict(a=1, b=2) != dict(b=2, a=1)
To crytographically sign RDF in any format (XML, JSON, JSON-LD, RDFa), a canonicalization algorithm is applied to normalize the input data prior to hashing and cryptographically signing. Like Merkle hashes of tree branches, a cryptographic signature of a normalized graph is a substitute for more complete tests of isomorphism.
Also, pickle stores the class name to unpickle data into as a (variously-dotted) str. If the version of the object class is not in the class name, pickle will unpickle data from appA.Pickleable into appB.Pickleable (or PickleableV1 into PickleableV2 objects, as long as PickleableV2=PickleableV1 is specified in the deserializer).
So do methods need to be pickled? No for security. Yes because otherwise the appB unpickled data is not isomorphic with the pickled appA.Pickleable class instances.
One Solution: add a version attribute on each object, store it with every object, and discard it before testing equality by other attributes.
Another solution: include the source object version in the class name that gets stored with every pickled object instance, and try hard to make sure the dest object is the same.
joblib is not fully secure because it still relies on Pickle internally. The reason it is slightly better in pickle is due to fact that pickle file gets immediately executed when it gets imported whereas joblib doesn’t execute code just by being imported.
pytorch save/load still are pickle based models. Its fine for trusted sources but when you start using from untrusted sources then there is always a risk of ACE.
If you want to execute it, would suggest to try it in a sandbox env like docker, VM or online notebooks envs or other option is to inspect the model file.
As Open source AI booms, the risk of supply chain attacks also increases.
our approach wasn’t about over-engineering, we were trying to leverage our existing investments (like Confluent BYOC) while optimizing for flexibility, cost, and performance. We wanted to stay loosely coupled to adapt to cloud restrictions across multiple geographic deployments.
We did have a discussion on Self vs Managed and TCOs associated with it.
1> We have multi regional setup so it came up with Data Sovereignty requirements.
2> Vendor Lock ins - Few of the services were not available in that geographic region
3> With managed services, you often pay for capacity you might not always use. our workloads were often consistent and predictable, so self managed solutions helped in fine tuning our resources.
4> One og the goal was to keep our storage and compute loosely coupled while staying Iceberg-compatible for flexibility. Whether it’s Trino today or Snowflake/Databricks tomorrow, we aren’t locked in.
As for BigQuery, while it's a great tool, we faced challenges with high-volume, small queries where costs became unpredictable as it is priced per data volume scanned. Clustered tables, Materialised views helped to some extent, but they didn’t fully mitigate the overhead for our specific workloads. There are ways to overcome and optimize it for sure so i wouldn't exactly put it on GBQ or any limitations.
It’s always a trade-off, and we made the call that best fit our scale, workloads, and long-term plans
I am not sure if managing kafka connect cluster in too expensive in long term. This solution might work for you based on your needs. I would suggest to look for alternatives.
IMO, i dont think terraform is the right tool for containerized services. I had experimented with terraform and ansible for deployments earlier but i could see simpler deployments using serverless or apex.