More

ankrgyl · on May 28, 2023

Great to see popular SaaS use cases commoditized and opened up like this. After product categories start to fossilize into a standard UX + features, the benefits of using an expensive vendor start to decrease.

alanaan · on May 28, 2023

exactly. docsend is a great product, but once my free trial ended i didn’t want to pay $20/mo just to send documents. maybe i’m cheap ¯\_(ツ)_/¯

ankrgyl · on March 30, 2023

Will you ever consider using EBS volumes or another mechanism which supports random writes?

ankrgyl · on March 29, 2023

Why use Kubernetes if you’re in control of the cloud environment? What does it bring to the table? Why not firecracker?

tychoish · on March 29, 2023

In a lot of ways, having this be based on k8s provides a lot of flexibility and independence, and with k8s there's much less friction to providing computes with high locality relative to applications/users application code.

It's also the case that by staying with k8s we can take advantage of existing operational tooling, experience, and work, and can focus or development time on the important parts of this problem: runtime scaling, scheduling, and virtual machine management and not on cloud provider APIs and management.

In short, k8s gives us options that we like for the future, it's shortening the development cycle, and only getting in our way a below-average amount. At the same time--for the most part--we're building this with reasonable abstractions that would let us reuse our existing work if k8s becomes more trouble than it's worth.

nikita · on March 29, 2023

Firecracker doesn’t support live migrations. There is a new project called cloud hypervisor and it showed a lot of promise, but we struggled to make it works and reverted to QEMU

As for k8s its an ongoing debate internally if the complexity worth the benefit. It helps us provision nodes but we have to fight it quite a bit too. It’s unclear we will keep it long term

ankrgyl · on Feb 7, 2023

I love DuckDB and am cheering for MotherDuck, but I think bragging about how fast you can query small data is really no different than bragging about big data. In reality, big data's success is not about data volume. It's about enabling people to effectively collaborate on data and share a single source of truth.

I don't know much about MotherDuck's plans, but I hope they're focused on making it as easy to collaborate on "small data" as Snowflake/etc. have made it to collaborate on "big data".

ankrgyl · on Nov 30, 2022

It’d be great to show the debugging experience in the video (in fact, I’d prefer seeing that over the breadth of features). E.g. what happens when there’s a syntax error in my sql query or the python code fails on an invalid input?

That tends to be the critical make it or break it feature when you’re writing code in an app builder.

kvh · on Nov 30, 2022

Agree, debugging is a critical user experience! In Patterns, you'll see the full stack trace and all logs when you execute Python or SQL.

ankrgyl · on Nov 27, 2022

The Rust vs. Go comparison has two key differences:

- The Rust example uses 8 bit unsigned ints vs. Go example uses 32 bit signed ints

- Rust's sort is stable by default whereas Go's is not.

If you tweak the Rust benchmark to use `i32` instead of `u8` and `sort_unstable` instead of `sort`, you should see ~3-4x faster performance.

zRedShift · on Nov 27, 2022

Made a PR with the fixes, Rust is now 3 times faster than tinygo, and the wasm is almost 3 times smaller (wasm+js is twice as small) as expected.

https://github.com/Ecostack/wasm-rust-go-asc/pull/1

My first foray into wasm, so I probably missed some optimizations like wasm-opt.

miohtama · on Nov 27, 2022

Also I would assume different languages have different random() implementations which could contribute to the run time. So to make tests equal, you should not measure time to set up the array.

twp · on Nov 27, 2022

The Go version should also use `sort.Ints`. https://pkg.go.dev/sort#Ints

ankrgyl · on Oct 11, 2022

There is a visual demo here: https://sites.google.com/view/llm4html/home.

This work is very exciting to me for a few reasons:

- HTML is an incredibly rich source of visually structured information, with a semi-structured representation. This is as opposed to PDFs, which are usually fed into models with a "flat" representation (words + bounding boxes). Intuitively, this offers the model a more direct way to learn about nested structure, over an almost unlimited source of unsupervised pre-training data.

- Many projects (e.g. Pix2Struct https://arxiv.org/pdf/2210.03347.pdf, also from Google) operate on pixels, which are expensive (both to render and process in the transformer). Operating on HTML directly means smaller, faster, more efficient models.

- (If open sourced) it will be the first (AFAIK) open pre-trained ready-to-go model for the RPA/automation space (there are several closed projects). They claim they plan to open source the dataset at least, which is very exciting.

I'm particularly excited to extend this and similar (https://arxiv.org/abs/2110.08518) for HTML question answering and web scraping.

Disclaimer: I'm the CEO of Impira, which creates OSS (https://github.com/impira/docquery) and proprietary (http://impira.com/) tools for analyzing business documents. I am not affiliated with this project.

ShamelessC · on Oct 11, 2022

Exciting/scary stuff! A sophisticated enough version could carry out any range of tasks that a typical computer user/browser could from just a few sentences with somewhat high chance of success.

we will overuse this tech, forgetting important processes that are perhaps wise to keep a "human backup" for redundancy. Then again, RPA is already a case where a "proper" rewrite of some multi-program pipeline is impossible.

ankrgyl · on Oct 11, 2022

This is a "classic" tension. Having worked in the (broader) RPA space for a while, I would say that the true north star of most processes is (a) rewriting the internal procedures to be transformations on data (not UIs) and (b) standardizing communication across companies.

There is a lot of momentum to solve (a) with no code, but it's slow because processes are impossibly complex. I think AI will accelerate this and could result in the "human backup" dystopia. On the other hand, AI can also be used to generate code, and I'm optimistic that technology like this can accelerate humans' ability to encode complex processes robustly (as transformations of data) and would 10 or 100x less work than no/low code.

ShamelessC · on Oct 11, 2022

> On the other hand, AI can also be used to generate code, and I'm optimistic that technology like this can accelerate humans' ability to encode complex processes robustly (as transformations of data) and would 10 or 100x less work than no/low code.

Ah right, lots of angles to consider! A hybrid system would certainly be interesting. Let the AI runtime generate and evaluate code to perform tasks (e.g. selenium/puppeteer in python/java). Upon failure, "escalate permissions" to enable DOM control, or full mouse/keyboard to complete the task (probably best not to let the thing open up a code-editor with M/KB controls though heh)

yoquan · on Oct 12, 2022

The model used by this research is T5 which is open sourced already. So I think once the dataset is released, we'll see the open version of pre-trained model very soon.

hwers · on Oct 11, 2022

This is google, they for sure aren’t releasing the weights

ankrgyl · on Oct 12, 2022

They release a lot of weights open source, including T5 (the underlying model they used in this work). They also indicated their intent here: https://twitter.com/aleksandrafaust/status/15799326368934420....

nl · on Oct 12, 2022

Google Research releases the vast majority of their interesting model weights. See eg:

https://tfhub.dev/google

https://tfhub.dev/tensorflow

https://tfhub.dev/deepmind

https://tfhub.dev/mediapipe

https://tfhub.dev/ml-kit/collections/image-classification/1

yellsatclouds · on Oct 11, 2022

could also jump straight into the code that generates so much of that 'unlimited' html (them web frameworks)

ankrgyl · on Sept 25, 2022

DocQuery (https://github.com/impira/docquery), a project I work on, allows you to do something similar, but search over semantic information in the PDF files (using a large language model that is pre-trained to query business documents).

For example:

  $ docquery scan "What is the due date?" /my/invoices/
  /my/invoices/Order1.pdf       What is the due date?: 4/27/2022
  /my/invoices/Order2.pdf       What is the due date?: 9/26/2022
  ...

It's obviously a lot slower than "grepping", but very powerful.

pugio · on Sept 25, 2022

Wow this is exactly what I've been looking for, thank you! I just wish with these transformer models it was possible to extract a structured set of what the model "knows" (for e.g. easy search indexing ). These natural language question systems are a little too fuzzy sometimes.

ankrgyl · on Sept 25, 2022

Can you tell me a bit more about your use case? A few things that come to mind:

- There are some ML/transformer-based methods for extracting a known schema (e.g. NER) or an unknown schema (e.g. relation extraction). - We're going to add a feature to DocQuery called "templates" soon for some popular document types (e.g. invoices) + a document classifier which will automatically apply the template based on the doc type. - Our commercial product (http://impira.com/) supports all of this + is a hosted solution (many of our customers use us to automate accounts payable, process insurance documents, etc.)

pugio · on Sept 25, 2022

Your commercial product looks very cool, but my use case is in creating an offline-first local document storage system (data never reaches a cloud). I'd like to be enable users to search through all documents for relevant pieces of information.

The templates sound very cool - are they essentially just using a preset list of (natural language) queries tied to a particular document class? It seems like you're using a version of donut for your document classification?

ankrgyl · on Sept 25, 2022

> but my use case is in creating an offline-first local document storage system (data never reaches a cloud).

Makes sense -- this is why we OSS'd DocQuery :)

> The templates sound very cool - are they essentially just using a preset list of (natural language) queries tied to a particular document class? It seems like you're using a version of donut for your document classification?

Yes that's the plan. We've done extensive testing with other approaches (e.g. NER) and realized that the benefits of using use-case specific queries (customizability, accuracy, flexibility for many use cases) outweigh the tradeoffs (NER only needs one execution for all fields).

Currently, we support pre-trained Donut models for both querying and classification. You can play with it by adding the --classify flag to `docquery scan`. We're releasing some new stuff soon that should be faster and more accurate.

pugio · on Sept 25, 2022

Sweet! I'll keep an eye on the repo. Thank you for open sourcing DocQuery. I agree with your reasoning: my current attempts to find an NER model that covers all my use cases have come up short

mbb70 · on Sept 25, 2022

Since you mention insurance documents, could you speak to how well this would extract data from a policy document like https://ahca.myflorida.com/medicaid/Prescribed_Drug/drug_cri... ?

The unstoppable administrative engine that is the American Healthcare system produces hundreds of thousands of continuously updated documents like this with no standardized format/structure.

Manually extracting/normalizing this data into a querable format is an industry all its own.

ankrgyl · on Sept 25, 2022

It's very easy to try! Just plug that URL here: https://huggingface.co/spaces/impira/docquery.

I tried a few questions:

  What is the development date? -> June 20, 2017
  What is the medicine? -> SPINRAZA® (nusinersen)
  How many doses -> 5 doses
  Did the patient meet the review criteria? -> Patient met initial review criteria.
  Is the patient treated with Evrysdi? -> not

a1369209993 · on Sept 25, 2022

> to extract a structured set of what the model "knows"

To be fair, that's impossible in the general case, since the model can know things (ie be able to answer queries) without knowing that it knows them (ie being able to produce a list of anserable queries by any means significantly more efficient than trying every query and seeing which ones work).

As a reducto ad absurdum example, consider a 'model' consisting of a deniably encrypted key-value store, where it's outright cryptographically guaranteed that you can't effiently enumerate the queries. Neural networks aren't quite that bad, but (in the general-over-NNs case) they at least superficially appear to be pretty close. (They're definitely not reliably secure though; don't depend on that.)

ultrasounder · on Sept 25, 2022

This is so epic. I was just ruminating about this particular use-case. who are your typical customers. Supply chain or purchasing? Also I notice that you do text extraction from Invoices? Are you using something similar to CharGRID or its derivate BERTGRID? Wish you and your team more success!

ankrgyl · on Sept 25, 2022

Thank you ultrasounder! Supply chain, construction, purchasing, insurance, financial services, and healthcare are our biggest verticals. Although we have customers doing just about anything you can imagine with documents!

For invoices, we have a pre-trained model (demo here: https://huggingface.co/spaces/impira/invoices) that is pretty good at most fields, but within our product, it will automatically learn about your formats as you upload documents and confirm/correct predictions. The pre-trained model is based on LayoutLM and the additional learning we do uses a generative model (GMMs) that can learn from as little as one example.

LMK if you have any other questions.

tarasglek · on Oct 2, 2022

Can this work like AI readability extension? What's the content of this page? What are the navigation options on this page?

nsomaru · on Sept 26, 2022

I’d like a way to index a lot of case law and then ask it questions.

ankrgyl · on July 12, 2022

Many of the comments seem to miss the point entirely -- this post is about how to achieve sufficient decentralization:

> two users can find each other and communicate, even if the rest of the network wants to prevent it

and use centralized methods where it does not compromise on this constraint (e.g. efficiently storing/retrieving data about posts).

ankrgyl · on July 4, 2022

If your goal is to be a better coder long term, I’d suggest learning both. As many have pointed out, Go is quick to learn, and I think learning both will only take a small amount of extra time (compared to learning Rust).

They each present trade offs that make them a better tool under particular circumstances. While Rust exposes you to more sophisticated typesystem features, Go's M:N scheduler is an incredible piece of technology that is (IMO) unmatched by any other mainstream language.

Finally, regarding the garbage collector, if you learn both languages, you'll get to viscerally experience the tradeoffs of having the garbage collector (try writing the same program in each language). There are some projects where it makes sense to use one and others where you shouldn't. Trying out both is the best way to build up intuition for this kind of trade off.

kaba0 · on July 4, 2022

> Go's M:N scheduler is an incredible piece of technology that is (IMO) unmatched by any other mainstream language.

Java’s Loom will change the picture big time, and it is already available in preview mode! But otherwise I agree.

anko · on July 4, 2022

> Go's M:N scheduler is an incredible piece of technology that is (IMO) unmatched by any other mainstream language.

I guess it depends how you define mainstream, but in Rust, Rayon has a work stealing scheduler too.

On the other hand, it has existed in Erlang for decades, and Elixir takes full advantage of it.

As for my 2c, I would say Rust is a better choice than Go in terms of the first language to learn. The main reason for this, is it is easy to embed in other languages. When I get to a problem that is too slow in a higher level language, eg. Python or Elixir, Rust is a great way to solve this problem. Just have a look at polars (https://www.pola.rs/).

rowanG077 · on July 4, 2022

Afaik Go's scheduler is not unique. What does it do better then Haskell or Elixir?

ankrgyl · on July 4, 2022

It allows you to create many more language threads (go routines) than kernel threads, and unlike other languages properly hides this abstraction from you with a rope stack (so you don't need to create coroutines or use async/await syntax).

I'm not up to speed with the latest and greatest in Haskell or Elixir, so they very well may have something similar. Rust has great runtimes, like tokio, that do similar things, but without rope stacks and with painful async/await syntax.

unscaled · on July 5, 2022

A M:N Scheduler was neither revolutionary nor rare even when Go was launched. Even a mainstream like C# already had one (albeit based on continuations, until C# 5.0 came out). At the same time, some mainstream programming languages either had similar third-party M:N stackful schedulers when go came out (gevent in Python) or got them a little while after (Quasar in Java).

Go's scheduler was only somewhat unique in the combination of features it pursued:

1. M:N

2. Stackful (i.e. unoptimized memory usage for each task/goroutine)

3. But using very small stacks[1] so it's easier to create a very large number of goroutines.

4. Integrated with the GC

5. Colorless functions

6. Built-in

7. No access to native threads

8. Not configurable or customizable.

9. Run as Native code, without using a virtual machine.

Colored functions (Async/await or Kotlin's suspend) are a matter of taste. They're heavily criticized for the burden they add, but advocates prefer the extra type-safety they provide. If you want to be able to statistically analyze or data races (or prevent them completely, as Rust does), I don't think you can avoid them.

Speaking of Rust, Rust did start with an M:N, work-stealing scheduler based on stackful coroutines. This scheduler was eventually removed[2] from the standard library, since it was deemed a bad match for a systems language.

Go was originally marketed as a systems language, but it was really a language that was optimized for writing concurrent servers by large teams of programmers with varying experience[3]. Specifically it was designed for server software at Google[4], and to replace the main place C++ was used for in Google: I/O-bound server software. That's why Go made very radical choices:

- Go Native (we still need good performance, but not C++-level) - Maximize concurrency - Make concurrency easy - Use GC (we need the language to be much easier than C++) - Minimize GC pause times (we need reasonable performance at server workloads)

This meant that the Go M:N Scheduler was usually the best performing stackful scheduler for server loads for a while. Interpreted languages like Python, Node and Lua were slower and were either single-threaded or had a GIL. Erlang and Java used a VM, and weren't optimized for low latency GC. C and C++ had coroutines libraries, but since these languages were not garbage-collected, it was harder to optimize the coroutine stack size.

I think it created a wrong impression that the Go scheduler was revolutionary or best-in-class. It never was. The groundbreaking thing that Go did is to optimize the entire language for highly concurrent I/O-bound workloads. That, along with a sane (i.e. non-callback-based) asynchronous model and great PR from being a Google language helped Go popularize asynchronous programming.

But I wouldn't say it is unmatched by any mainstream language nowadays. Java has low-latency GC nowadays and it's working on it's own Go-like coroutine implementation (Project Loom). But all mainstream JVM languages (Kotlin, Scala and Clojure) already have their own M:N schedulers.

Rust, in the meantime, is strictly more performant than Go: It's using state-machine based stackless coroutines, which emulate the way that manual asynchronous implementations (like Nginx) are done in C. You can't get more efficient than that. Not to mention that Rust doesn't have a GC and features more aggressive compiler optimizations.

[1] IIRC, initial stack sizes has changed between 2k-8k during the language lifetime, and the stack resizing mechanism has changed as well.

[2] https://github.com/rust-lang/rfcs/blob/master/text/0230-remo...

[3] https://talks.golang.org/2012/splash.article

[4] This is why it didn't have a package managers for many years. Google was just using a monorepo after all.

pizza234 · on July 4, 2022

I've tried both and failed.

It really depends on the goal. If one wants to end up with an elaborate hellow world, probably yes, one can learn both.

Learning Rust to make it applicable to mid/large-scale projects, and potentially for a professional transition (my case), is a one-of-a-kind commitment, in my opinion, hardly compatible with learning another language at the same time.

ankrgyl · on July 4, 2022

In that case, I'd suggest learning Go first, as a kind of "step up" to learning Rust. If you haven't already, I'd also suggest learning C.

I think there's some linearity to these tasks. Rust (and C++) are an amalgamation of many complex features that are present in simpler predecessors. Learning C for example will not only speed up your ability to learn Rust, but also leave you with a deeper understanding of the core principles.

pizza234 · on July 4, 2022

> If you haven't already, I'd also suggest learning C.

Learning a new language is not something to take lightly. Should I take 6 months in order to get more solid foundations before moving to Rust? 1 year? N years?

My biggest hurdle has been ownership ("borrow checking", which is the common problem) in the broad sense (which includes: learning to design a whole program considering ownership). This is something that C/++ programmers are surely familiar with, so it absolutely helps, but I don't think it's efficient to learn C/++ in order to step up to Rust.

ankrgyl · on July 4, 2022

The more languages you learn, especially with reusable concepts, the less of a tax it is to learn a new one. As someone who has written C/C++ for over a decade, learning Rust felt very straightforward -- I just had to learn about the borrow checker, and it was fairly straightforward to reason about how it works based on my knowledge of unique pointers in C++.

I certainly don't think C++ is a prerequisite to learning Rust, but I do think your path to deeply understanding Rust will accelerate if you understand C (or specifically, understand pointers and heap allocations). But to each their own!

pizza234 · on July 4, 2022

Learning new languages is certainly useful/important, but time is finite. The idea of learning a certain language as a bridge to another is unrealistic unless one has a _lot_ of time.

sshine · on July 4, 2022

I come from Haskell and fought fiercely with the borrow checker for two months. We’re friends now.

I talked to a C programmer who said thinking in lifetimes (when to free()) was second nature, so didn’t even think heavily about it.

LecroJS · on July 4, 2022

Just replied to another comment recommending C. Can you help me understand which core principles C would help to refine i.e. what am I leaving on the table if I go with a Rust or C++?

kaba0 · on July 4, 2022

Nothing, these languages all expose the same underlying memory controls. But going with C may still be valuable, because rust and C++ has much better abstractions and especially with Rust, it can easily hide the pointer manipulation part from you, when that’s the reason one presumably learnt the language for.

LecroJS · on July 4, 2022

Thanks, exactly what I was looking for!

abducer · on July 4, 2022

> Just replied to another comment recommending C. Can you help me understand which core principles C would help to refine i.e. what am I leaving on the table if I go with a Rust or C++?

In the other thread you mention features. I suggest C because of its "simplicity", that is, its lack of features. It really is a thin abstraction on top of assembly. C's closeness to the hardware is edifying.

Moreover, since the goal is learning, History is quite relevant. Languages evolve in the context of their predecessors. You probably already know this for typescript since it is so close to JavaScript. But how do C++ and rust relate to C? And how does Go relate to C++ and C?

Languages always involve tradeoffs. For example, a language that checks memory bounds at runtime (Go) uses more CPU cycles than one that doesn't (C). A compiler that does checks is more complex than one that doesn't. Language features can help with safety (rust), but then the language takes more time to learn. Complexity can be hidden (Go garbage collection) to make up front learning easier, but this can make the language less flexible or more difficult to debug. Complexity can be exposed later but then the language learning curve is lengthened.

I'm not suggesting you go become a C pro, but learn enough to shoot yourself in the foot. It will help contextualize the features of these other languages and give your more mental tools for decision making in your own code.

ankrgyl · on July 4, 2022

Pointers :). C++ and Rust try to hide the concept away with References, which add a very helpful amount of type safety, but do not protect you from the fundamental idea of allocating memory and then later free-ing it.

It's not that you leave it on the table with Rust/C++ but more that you'll have a deeper understanding for what's going on if you understand how pointers work in C.

LecroJS · on July 4, 2022

Thanks this is what I was looking for! Pointers are one of those things I’ve heard about for years but don’t know too much about. I know that in js objects are passed by copies of a reference to that object in memory. I have never felt that there was something missing here — only that it requires an understanding of which types in js are passed by value vs reference. What is the upside to pointers in C? For instance in js I can pass an object to a function that modifies the object’s properties. The only thing I can’t do (I think) is pass an object to a function, and then reassign that object’s reference to a new object literal such that the original object reference now references the new object literal. Is this gap in a language like js where pointers come in, or am I missing the forest for the trees? If this is where pointers become useful/beneficial to be knowledgeable on, can they really be that useful? Full disclaimer that I don’t know anything about pointers aside from implementing data structures in js e.g. a linked list node has a `this.next` property, but in other languages this seems to be called/implemented as a pointer from what I’ve seen.

To sum my thoughts on this up: can you help me understand why pointers would be useful for me to invest in learning on the level of C versus continuing to let them be abstracted away/not present in my language? What would that knowledge do for me? Will I be a better problem solver etc. thanks again!

ankrgyl · on July 4, 2022

It's not that there's an "advantage" to pointers, but rather it's the reality that they exist. Everything you're doing in JS also "uses" pointers under the hood, but the language and runtime abstract that away from you. C exposes you to pointers directly, so you have an opportunity to learn what they are and how they work.

It's important to learn how they work, because it'll give you a better idea of what's actually happening in the higher level languages you're using. That may not be important if you stick to JS or Go, but if you'd like to learn Rust then it's impossible to accurately think through tradeoffs like whether to use Box, Rc, Arc, etc. without understanding how pointers work. In other words, I'd only recommend learning Rust if you take the time to understand pointers, and the best way to understand pointers is to write C.