jchandra's comments

jchandra · 2025-05-04T14:22:10 1746368530

Totally fair point — at the end of the day, it's all about getting the best model performance. I was mostly trying to highlight how, under the hood, a lot of modern HPO algos really boil down to smart scheduling decisions.

vivahir215 · 2025-05-04T15:59:15 1746374355

Total Computational Budget in Hyperband need to be elaborated. There are more things to it.

jchandra · 2025-03-21T15:41:06 1742571666

Pickle still is good for custom objects (JSON loses methods and also order), Graphs & circular refs (JSON breaks), Functions & lambdas (Essential for ML & distributed systems) and is provided out of box.

zahlman · 2025-03-23T03:44:49 1742701489

We're contemplating protocols that don't evaluate or run code; that rules out serializing functions or lambdas (i.e., code).

Custom objects in Python don't have "order" unless they're using `__slots__` - in which case the application already knows what they are from its own class definition. Similarly, methods don't need to be serialized.

A general graph is isomorphic to a sequence of nodes plus a sequence of vertex definitions. You only need your own lightweight protocol on top.

westurner · 2025-03-23T19:34:33 1742758473

Because globals(), locals(), Classes and classInstances are backed by dicts, and dicts are insertion ordered in CPython since 3.6 (and in the Python spec since 3.7), object attributes are effectively ordered in Python.

Object instances with __slots__ do not have a dict of attributes.

__slots__ attributes of Python classes are ordered, too.

(Sorting and order; Python 3 objects must define at least __eq__ and __lt__ in order to be sorted. @functools.total_ordering https://docs.python.org/3/library/functools.html#functools.t... )

Are graphs isomorphic if their nodes and edges are in a different sequence?

  assert dict(a=1, b=2) == dict(b=2, a=1)

  from collections import OrderedDict as odict
  assert dict(a=1, b=2) != dict(b=2, a=1)

To crytographically sign RDF in any format (XML, JSON, JSON-LD, RDFa), a canonicalization algorithm is applied to normalize the input data prior to hashing and cryptographically signing. Like Merkle hashes of tree branches, a cryptographic signature of a normalized graph is a substitute for more complete tests of isomorphism.

RDF Dataset Canonicalization algorithm: https://w3c-ccg.github.io/rdf-dataset-canonicalization/spec/...

Also, pickle stores the class name to unpickle data into as a (variously-dotted) str. If the version of the object class is not in the class name, pickle will unpickle data from appA.Pickleable into appB.Pickleable (or PickleableV1 into PickleableV2 objects, as long as PickleableV2=PickleableV1 is specified in the deserializer).

So do methods need to be pickled? No for security. Yes because otherwise the appB unpickled data is not isomorphic with the pickled appA.Pickleable class instances.

One Solution: add a version attribute on each object, store it with every object, and discard it before testing equality by other attributes.

Another solution: include the source object version in the class name that gets stored with every pickled object instance, and try hard to make sure the dest object is the same.

jchandra · 2025-03-15T17:07:14 1742058434

joblib is not fully secure because it still relies on Pickle internally. The reason it is slightly better in pickle is due to fact that pickle file gets immediately executed when it gets imported whereas joblib doesn’t execute code just by being imported.

vivahir215 · 2025-03-15T17:14:07 1742058847

ah okay. Didnt know this. I generally use pytorch save models for my workflow.

jchandra · 2025-03-15T17:20:59 1742059259

pytorch save/load still are pickle based models. Its fine for trusted sources but when you start using from untrusted sources then there is always a risk of ACE. If you want to execute it, would suggest to try it in a sandbox env like docker, VM or online notebooks envs or other option is to inspect the model file.

As Open source AI booms, the risk of supply chain attacks also increases.

vivahir215 · 2025-03-16T03:45:03 1742096703

Cool.

jchandra · 2025-03-10T05:06:34 1741583194

jchandra · 2025-03-10T05:03:32 1741583012

our approach wasn’t about over-engineering, we were trying to leverage our existing investments (like Confluent BYOC) while optimizing for flexibility, cost, and performance. We wanted to stay loosely coupled to adapt to cloud restrictions across multiple geographic deployments.

jchandra · 2025-03-09T18:55:33 1741546533

We did have a discussion on Self vs Managed and TCOs associated with it. 1> We have multi regional setup so it came up with Data Sovereignty requirements. 2> Vendor Lock ins - Few of the services were not available in that geographic region 3> With managed services, you often pay for capacity you might not always use. our workloads were often consistent and predictable, so self managed solutions helped in fine tuning our resources. 4> One og the goal was to keep our storage and compute loosely coupled while staying Iceberg-compatible for flexibility. Whether it’s Trino today or Snowflake/Databricks tomorrow, we aren’t locked in.

jchandra · 2025-03-09T18:44:41 1741545881

As for BigQuery, while it's a great tool, we faced challenges with high-volume, small queries where costs became unpredictable as it is priced per data volume scanned. Clustered tables, Materialised views helped to some extent, but they didn’t fully mitigate the overhead for our specific workloads. There are ways to overcome and optimize it for sure so i wouldn't exactly put it on GBQ or any limitations.

It’s always a trade-off, and we made the call that best fit our scale, workloads, and long-term plans

vivahir215 · 2025-03-09T18:49:09 1741546149

Hmm, Okay.

I am not sure if managing kafka connect cluster in too expensive in long term. This solution might work for you based on your needs. I would suggest to look for alternatives.

throwaway7783 · 2025-03-09T21:33:42 1741556022

Did you consider slots based pricing model for BQ?

jchandra · on June 8, 2021

https://www.greenhouse.io/ down as well.

jchandra · on Dec 31, 2020

IMO, i dont think terraform is the right tool for containerized services. I had experimented with terraform and ansible for deployments earlier but i could see simpler deployments using serverless or apex.

Informative article though.