More

jackpirate · 2026-02-10T16:51:31 1770742291

> Edit: Changed this from email because email validation is a can of worms as an example

Email honestly seems much more straightforward than dates... Sweden had a Feb 30 in 1712, and there's all sorts of date ranges that never existed in most countries (e.g. the American colonies skipped September 3-13 in 1752).

legulere · 2026-02-11T06:14:58 1770790498

It’s a ISO-standard to use Gregorian dates even for dates predating its invention. If you need to support anything else (I never had to in my Eurocentric work so far), you’ll need to model calendars, similar to how temporal did for JavaScript: https://tc39.es/proposal-temporal/docs/calendars.html

flqn · 2026-02-10T16:54:25 1770742465

Dates are unfortunate in that you can only really parse them reliably with a TZDB.

jackpirate · 2025-10-13T22:05:16 1760393116

Why not have a conversation instead of downvoting. What did I say is wrong?

Your second paragraph is implying that the half of Americans who voted for Trump are "bad Americans". That seems to be sowing the division that your first paragraph warns against (even if it is a reason to dislike Trump).

I don't think either democrats or republicans can claim the moral high ground about sowing division.

stevenbedrick · 2025-10-13T22:14:48 1760393688

It seems to me as though you're reading a lot in to that second paragraph. Are you disputing the basic facts outlined, about "masked agents roaming the streets kidnapping people in broad daylight"? Because that is, in fact, a thing that is happening in cities all over the country right now, and simply pointing out that it is happening is not a partisan act.

zahlman · 2025-10-13T22:42:59 1760395379

The partisan act is the description of what is going on.

Can you show that the arrests are unlawful? Or else what exactly is your basis for the use of the term "kidnapping"?

sgustard · 2025-10-14T01:02:48 1760403768

How about "Federal judge rules ICE arrests at Liberty restaurant unlawful"? Does this meet your standard?

https://fox4kc.com/news/federal-judge-rules-ice-arrests-at-l...

mlrtime · 2025-10-14T11:31:29 1760441489

This is good, in means we have checks and balances. I don't have the details of this particular case, but does it mean every action ICE takes is illegal?

If someone breaks the law ICE or otherwise, there should be enforcement and justice.

https://www.govinfo.gov/content/pkg/PLAW-104publ208/pdf/PLAW...

thomassmith65 · 2025-10-13T23:05:58 1760396758

I posted a substantial reply to this comment but immediately deleted it. It's impossible honestly to take issue here without crossing into culture war territory.

mempko · 2025-10-14T14:23:29 1760451809

First of all, half of Americans didn't vote for Trump, at best a fourth. Look up voter turnout and of those that voted. And yes, not voting is legitimate when you believe both parties don't represent you. This idea that half of Americans voted for Trump makes no sense.

Not only that but most people don't approve of his immigration policy.

https://www.economist.com/interactive/trump-approval-tracker

He is going against the will of the people with unpopular policies

jackpirate · 2025-07-22T15:59:35 1753199975

What's the origin of XXX? I've seen FIXME/NOTE/TODO all over the place, but never encountered XXX before.

slongfield · 2025-07-22T16:21:20 1753201280

It has some ancient history as a morse code distress signal: https://regulatorylibrary.caa.co.uk/923-2012/Content/Regs/03...

And it shows up in some old BSD code: https://www.snellman.net/blog/archive/2017-04-17-xxx-fixme/

But... I think repeated letters are just easier to type than any other string, and since X looks like the classic "marks the spot" logo, it's what people jump to.

1659447091 · 2025-07-22T16:50:06 1753203006

I always thought it was from Java, but that's probably a personal bias; I am sure it was used long before Java was a thing. I did find this though (archived from 1999):

https://www.oracle.com/java/technologies/javase/codeconventi...

zahlman · 2025-07-22T17:02:35 1753203755

Unclear, but we do have https://www.catb.org/jargon/html/X/XXX.html

> Some hackers liken ‘XXX’ to the notional heavy-porn movie rating.

This seems plausible given the older culture ("this is metaphorically dirty, and therefore like porn", insert puerile snickering) and I can recall old jokes about "searching for" these markings. But I think it's also just about it visually standing out - the X character filling the terminal display cell with sharp lines.

o11c · 2025-07-23T01:51:29 1753235489

Hm, this made me wonder about the use of XXX as a label for alcohol bottles in cartoons.

It turns out it refers to "moonshine that has been distilled 3 times, reaching very high alcohol content".

pisipisipisi · 2025-07-22T16:06:57 1753200417

Amsterdam.

jackpirate · 2025-05-21T20:30:38 1747859438

I hate to be "reviewer 2", but:

I used to work on what your paper calls "unsupervised transport", that is machine translation between two languages without alignment data. You note that this field has existed since ~2016 and you provide a number of references, but you only dedicate ~4 lines of text to this branch of research. There's no comparison about why your technique is different to this prior work or why the prior algorithms can't be applied to the output of modern LLMs.

Naively, I would expect off-the-shelf embedding alignment algorithms (like <https://github.com/artetxem/vecmap> and <https://github.com/facebookresearch/fastText/tree/main/align...>, neither of which are cited or compared against) to work quite well on this problem. So I'm curious if they don't or why they don't.

I can imagine there is lots of room for improvements around implicit regularization in the algorithms. Specifically, these algorithms were designed with word2vec output in mind (typically 300 dimensional vectors with 200000 observations), but your problem has higher dimensional vectors with fewer observations and so would likely require different hyperparameter tuning. IIRC, there's no explicit regularization in these methods, but hyperparameters like stepsize/stepcount can implicitly add L2 regularization, which you probably need for your application.

---

PS.

I *strongly dislike* your name of vec2vec. You aren't the first/only algorithm for taking vectors as input and getting vectors as output, and you have no right to claim such a general title.

---

PPS.

I believe there is a minor typo with footnote 1. The note is "Our code is available on GitHub." but it is attached to the sentence "In practice, it is unrealistic to expect that such a database be available."

jxmorris12 · 2025-05-21T20:42:40 1747860160

Hey, I appreciate the perspective. We definitely should cite both those papers, and will do so in the next version of our draft. There are a lot of papers in this area, and they're all a few years old now, so you might understand how we missed two of them.

We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables. So some of this is covered. A lot of these methods also require a seed dictionary, which we don't have in our case. That said, you're welcome to take any number of these tools and plug them into our codebase; the results would definitely be interesting, although we can expect the adversarial methods still work best, as they do in the problem settings you mention.

As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.

jackpirate · 2025-05-21T22:57:12 1747868232

> We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables.

Sorry if I'm being obtuse, but I don't see any mention of the POT package in your paper or of what specific algorithms you used from it to compare against. My best guess is that you used the linear map similar to the example at <https://pythonot.github.io/auto_examples/domain-adaptation/p...>. The methods I mentioned are also linear, but contain a number of additional tricks that result in much better performance than a standard L2 loss, and so I would expect those methods to outperform your OT baseline.

> As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.

But both of those papers are about generic vector alignment, so the generality of the name makes sense. Your contribution here seems specifically about the LLM use case, and so a name that implies the LLM use case would be preferable.

I do agree though that in general naming is hard and I don't have a better name to suggest. I also agree that there's lots of related papers, and you can't cite/discuss them all reasonably.

And I don't mean to be overly critical... the application to LLMs is definitely cool. I wouldn't have read the paper and written up my critiques if I didn't overall like it :)

newfocogi · 2025-05-21T20:37:27 1747859847

Naming things is hard. Noting the two alternative approaches that you referenced are called "vecmap" and "alignment" which "aren't the first/only algorithm for ... and you have no right to claim such a general title" could easily apply there as well.

jackpirate · 2025-05-21T23:01:31 1747868491

Except those papers are 8ish years old; they actually were among the first 2-3 algs for this task; and they studied the fully general vector space alignment problem. But I agree that naming things is hard and don't have a better name.

mjburgess · 2025-05-21T22:53:27 1747868007

> I strongly dislike your name of vec2vec.

Imagine having more than a passing understanding of philosophy, and then reading much of any major computer science papers. By this "No right to claim" logic, I'd have you all on trial.

austinpilot · 2025-05-22T00:49:32 1747874972

The problem solved in this paper is strictly harder than alignment. Alignment works with multiple, unmatched representations of the same inputs (e.g, different embeddings of the same words). The goal is to match them up.

The goal here is harder: given an embedding of an unknown text in one space, generate a vector in another space that's close to the embedding of the same text -- but, unlike in the word alignment problem, the texts are not known in advance.

Neither unsupervised transport, nor optimal alignment can solve this problem. Their input sets must be embeddings of the same texts. The input sets here are embeddings of different texts.

FWIW, this is all explained in the paper, including even the abstract. The comparisons with optimal assignment explicitly note that this is an idealized pseudo-baseline, and in reality OA cannot used for embedding translation (as opposed to matching, alignment, correspondence, etc.)

jackpirate · 2025-05-14T19:11:27 1747249887

It seems like you have some misconceptions about Strassen's alg:

1. It is a standard example of the divide and conquer approach to algorithm design, not the dynamic programming approach. (I'm not even sure how you'd squint at it to convert it into a dynamic programming problem.)

2. Strassen's does not require complex valued matrices. Everything can be done in the real numbers.

pontus · 2025-05-14T21:03:19 1747256599

I think the OP was pointing out that the reason Strasssen's algorithm works is that it somehow uncovered a kind of repeated work that's not evident in a simple divide and conquer approach. It's by the clever definition of the various submatrices that this "overlapping" work can be avoided.

In other words, the power of Strasssens algorithm comes from a strategy that's similar to / reminiscent of dynamic programming.

kenjackson · 2025-05-14T19:16:58 1747250218

I think the original poster was referring to the AlphaEvolve variant of Strassen's, not the standard Strassen (with respect to complex values).

jackpirate · 2025-05-02T03:25:50 1746156350

As a CS prof, I'd love to have this in my office for students to play with. Looks awesome!

jackpirate · 2025-04-10T21:54:27 1744322067

Building off this question, it's not clear to me why Python should have both t-strings and f-strings. The difference between the two seems like a stumbling block to new programmers, and my "ideal python" would have only one of these mechanisms.

nhumrich · 2025-04-10T22:00:45 1744322445

f-strings immediately become a string, and are "invisible" to the runtime from a normal string. t-strings introduce an object so that libraries can do custom logic/formatting on the template strings, such as decided _how_ to format the string.

My main motivation as an author of 501 was to ensure user input is properly escaped when inserting into sql, which you cant enforce with f-strings.

williamdclt · 2025-04-10T23:14:35 1744326875

> ensure user input is properly escaped when inserting into sql

I used to wish for that and got it in JS with template strings and libs around it. For what it’s worth (you got a whole PEP done, you have more credibility than I do) I ended up changing my mind, I think it’s a mistake.

It’s _nice_ from a syntax perspective. But it obscures the reality of sql query/parameter segregation, it builds an abstraction on top of sql that’s leaky and doesn’t even look like an abstraction.

And more importantly, it looks _way too close_ to the wrong thing. If the difference between the safe way to do sql and the unsafe way is one character and a non-trivial understanding of string formatting in python… bad things will happen. In a one-person project it’s manageable, in a bigger one where people have different experiences and seniority it will go wrong.

It’s certainly cute. I don’t thing it’s a good thing for sql queries.

nine_k · 2025-04-10T23:33:53 1744328033

I understand your concern, and I think the PEP addresses it. Quite bluntly, t"foo" is not a string, while f"foo" is. You'll get a typecheck error if you run a typechecker like any reasonable developer, and will get a runtime error if you ignore the type mismatch, because t"foo" even lacks a __str__() method.

One statement the PEP could put front and center in the abstract could be "t-strings are not strings".

guelo · 2025-04-10T23:41:24 1744328484

> "t-strings are not strings"

t-string is an unfortunate name for something that is not a string.

nine_k · 2025-04-10T23:46:25 1744328785

I wish it were called "string templates" instead, with t"whatever" form being called a "template literal".

DonHopkins · 2025-04-11T00:04:38 1744329878

Simpson's Individual Stringettes!

https://www.youtube.com/watch?v=7qNj-QFZbew

sevensor · 2025-04-11T01:29:11 1744334951

Away with floods! Away with workaday tidal waves!

jackpirate · 2025-04-11T00:49:29 1744332569

That all make senses to me. But it definitely won't make sense to my intro to programming students. They already have enough weird syntax to juggle.

nhumrich · 2025-04-11T00:54:43 1744332883

Then dont teach them t-strings

davepeck · 2025-04-10T21:58:36 1744322316

For one thing, `f"something"` is of type `str`; `t"something"` is of type `string.templatelib.Template`. With t-strings, your code can know which parts of the string were dynamically substituted and which were not.

all2 · 2025-04-10T22:32:16 1744324336

The types aren't so important. __call__ or reference returns type string, an f and a t will be interchangeable from the consumer side.

Example, if you can go through (I'm not sure you can) and trivially replace all your fs with ts, and then have some minor fixups where the final product is used, I don't think a migration from one to the other would be terribly painful. Time-consuming, yes.

itishappy · 2025-04-10T23:20:54 1744327254

Not sure that's true. `Template`s don't provide a `__str__` function, so you need to pass them to a processing function to get a `str` back.

https://peps.python.org/pep-0750/#no-template-str-implementa...

skeledrew · 2025-04-10T22:06:41 1744322801

Give it a few years to when f-string usage has worn off to the point that a decision can be made to remove it without breaking a significant number of projects in the wild.

milesrout · 2025-04-10T22:09:16 1744322956

That will never happen.

skeledrew · 2025-04-10T22:16:05 1744323365

Well if it continues to be popular then that is all good. Just keep it. What matters is that usage isn't complex for anyone.

macNchz · 2025-04-10T22:49:06 1744325346

Well now we'll have four different ways to format strings, since removing old ones is something that doesn't actually happen:

    "foo %s" % "bar"
    "foo {}".format("bar")
    bar = "bar"; f"foo {bar}"
    bar = "bar"; t"foo {bar}" # has extra functionality!

amenghra · 2025-04-10T23:01:00 1744326060

This is where an opinionated linter comes in handy. Ensures people gradually move to the “better” version while not breaking backwards compatibility.

It does suck for beginners who end up having to know about all variations until their usage drops off.

QuercusMax · 2025-04-10T23:14:21 1744326861

The linter is a big deal, actually. I've worked with Python off and on during the past few decades; I just recently moved onto a project that uses Python with a bunch of linters and autoformatters enabled. I was used to writing my strings ('foo %s % bar), and the precommit linter told me to write f'foo %{bar}'. Easy enough!

rtpg · 2025-04-10T23:19:40 1744327180

printf-style formatting ("foo %s" % "bar") feels the most ready to be retired (except insofar as it probably never will, because it's a nice shortcut).

The other ones at least are based on the same format string syntax.

"foo {}".format("bar") would be an obvious "just use f-string" case, except when the formatting happens far off. But in that case you could "just" use t-strings? Except in cases where you're (for example) reading a format string from a file. Remember, t- and f- strings are syntactic elements, so dynamism prevents usage of it!

So you have the following use cases:

- printf-style formatting: some C-style string formatting is needed

- .format: You can't use an f- string because of non-locality in data to format, and you can't use a t- string due to dynamism in

- f-string: you have the template and the data in the same spot lexicographically, and you just want string concatenation (very common!)

- t-string: you have the template and the data in the same spot lexicogrpahically, but want to use special logic to actually build up your resulting value (which might not even be a string!)

The last two additions being syntax makes it hard to use them to cover all use cases of the first two.

But in a specific use case? It's very likely that there is an exact best answer amongst these 4.

milesrout · 2025-04-11T01:36:01 1744335361

.format is also nice because you can have more complex subexpressions broken over multiple lines instead of having complex expressions inside the {}.

masklinn · 2025-04-11T03:24:15 1744341855

> printf-style formatting ("foo %s" % "bar") feels the most ready to be retired (except insofar as it probably never will, because it's a nice shortcut).

It’s also the only one which is anything near safe for being user provided.

pansa2 · 2025-04-11T03:30:07 1744342207

I don’t think I’ve ever used % formatting in Python - what makes it safer than `format`?

masklinn · 2025-04-11T04:51:46 1744347106

`str.format` allows the format string to navigate through indexes, entries, and attributes. If the result of the formatting is echoed back and any non-trivial object it passed in, it allows for all sorts of introspection.

printf-style... does not support any of that. It can only format the objects passed in.

rtpg · 2025-04-12T00:46:04 1744418764

Very good point. While I think we could do away with the syntactic shorthand, definitely would want to keep some function/method around with the capabilities.

skeledrew · 2025-04-10T23:12:42 1744326762

And if it's being used, and isn't considered problematic, then it should remain. I've found use for all the current ones: (1) for text that naturally has curlies, (2) for templating (3) for immediate interpolation, and improved at-site readability

I see (4) being about the flexibility of (2) and readability of (3). Maybe it'll eventually grow to dominate one or both, but it's also fine if it doesn't. I don't see (1) going away at all since the curly collision still exists in (4).

pletnes · 2025-04-21T10:11:16 1745230276

There is also the template module from the python 2 days.

milesrout · 2025-04-11T01:34:55 1744335295

Don't forget string.Template:

    import string
    t = string.Template("foo $bar")
    t.substitute(bar="bar")

darthrupert · 2025-04-11T07:23:18 1744356198

Five, if you count the log module. I hope t-strings will come there soon.

log.error("foo happend %s", reason)

bcoates · 2025-04-11T03:02:52 1744340572

Putting down my marker on the opposite. Once you're targeting a version of python that has t-strings, decent linters/libraries have an excuse to put almost all uses of f-strings in the ground.

aatd86 · 2025-04-10T23:18:15 1744327095

No backward compatibility?!

skeledrew · 2025-04-10T23:39:10 1744328350

If the usage of a feature is significantly close enough to 0 because there is a well used alternative, what need is there for backward compatibility? If anything, it can be pushed to a third party package on PyPI.

jackpirate · on Jan 2, 2025

I have a minor nit to pick. I actually prefer when tutorials provide the prompts for all code snippets for two reasons:

1. Many tutorials reference many languages. (I frequently write tutorials for students that include bash, sql, and python.) Providing the prompts `$`, `sqlite>` and `>>>` makes it obvious which language a piece of code is being written in.

2. Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line. A good example is a sequence of commands that involves `sudo dd` to format a harddrive. But for really intro-level stuff I want the student/reader to carefully think about all the commands, and forcing them to copy/paste line by line helps achieve that goal.

That said, this is an overall good introduction to writing that I will definitely making required reading for some of my data science students. When the book is complete, I'll be happily buying a copy :)

Uehreka · on Jan 2, 2025

> Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line.

I hardcore oppose this kind of thing, for the same reason I oppose people putting obstacles in the way of curl-to-bash.

Adding the prompt character doesn’t make people think, it just makes people press backspace. Frequently I’m reading a tutorial because I’m trying to assemble headless scripts for setting up a VM and I really just need verbatim lines I can copy/paste so I know I’ve got the right arguments.

mtlynch · on Jan 2, 2025

Thanks for reading!

>Many tutorials reference many languages. (I frequently write tutorials for students that include bash, sql, and python.) Providing the prompts `$`, `sqlite>` and `>>>` makes it obvious which language a piece of code is being written in.

I think it's fine to show the prompt character, but I think it's the author's job to make sure that copy/paste still works. I've seen a lot of examples that use CSS to show the prompt or line number without it becoming part of copied text, and I'm highly in favor of that.

I think if I had to choose between breaking copy/paste and making the language obvious with the prompt character, I'd exclude the prompt, but I think that's a matter of taste.

>Certain types of code should not be thoughtlessly copy/pasted, and providing multiline `$` prompts enforce that the user copy/pastes line by line. A good example is a sequence of commands that involves `sudo dd` to format a harddrive. But for really intro-level stuff I want the student/reader to carefully think about all the commands, and forcing them to copy/paste line by line helps achieve that goal.

Yeah, I agree about preventing the reader from copy/pasting something dangerous.

In tutorials that require a reboot, I'll never include a reboot command bunched in with other commands because I don't want the user to do it by mistake. And I agree for something like `dd`, you'd want to present it in a way to make it hard for the reader to make mistakes or run it thoughtlessly.

jackpirate · on Jan 2, 2025

> I've seen a lot of examples that use CSS to show the prompt or line number without it becoming part of copied text, and I'm highly in favor of that.

This is unfortunately not compatible with writing the tutorial in markdown to be rendered on github.

xmprt · on Jan 2, 2025

I'm not sure about that. There are markdown rendering engines where you can specify the language of a codeblock and it will render with specific CSS based on the language. So you can do something like ```bash ... ``` and it will show the code with newlines prefixed by "$"

yencabulator · on Jan 10, 2025

That's only for highlighting.

```bash assumes content is a valid shell script

```console assumes content is a dialog between a user and a computer, with $ or such prompts and unprefixed program output

Console sessions showing output is why you can't magically auto-prefix every line with a prompt.

grodriguez100 · on Jan 3, 2025

AFAIK specifying a language only makes a difference for syntax highlighting. I have never seen a markdown processor that would add prompts to the code based on the specified language.

xmprt · on Jan 3, 2025

Syntax highlighting is just CSS. There's nothing stopping you from adding your own custom CSS to the code block which will prefix lines with the prompts.

grodriguez100 · on Jan 5, 2025

Yes, of course. I was just describing what most existing engines currently do.

__mharrison__ · on Jan 3, 2025

I always include a $ in front of terminal commands (and > for Windows).

My books are Python related, so there is code that runs in putting and code that runs in other environments.

I guess I'm not really writing "tutorials" in the sense of webpages, so I'm less concerned with copy paste working and more concerned with clarity.

jackpirate · on Sept 19, 2024

> I can talk about concepts like "atoms" or "bacteria" or "black holes" with anyone, and they'll know what they are - even if their knowledge of those subjects isn't in depth.

I'm not convinced this is an unalloyed good. Knowing that a disease is caused by "bacteria" instead of "demons" isn't really helpful if you don't have a deep understanding of exactly what bacteria is. See, for example, all of the people who want antibiotics whenever they're sick for any reason. We've just replaced one set of weird beliefs in the general populace with another and given it a veneer of science.

rimunroe · on Sept 19, 2024

> Knowing that a disease is caused by "bacteria" instead of "demons" isn't really helpful if you don't have a deep understanding of exactly what bacteria is.

This is a poor example. Even an incomplete image of the germ theory of disease is a massive improvement over thinking illness is caused by demons. An extremely superficial understanding of bacteria as "microscopic organisms which can make you sick" gives good justification why people should do things like wash their hands, cover their mouth when coughing, and not lick the railing on a subway.

digging · on Sept 19, 2024

Knowing the difference between bacteria being living organisms and viruses being not-quite-alive does not qualify as a "deep understanding" though.

Further, the presence of people misunderstanding something that most of the population knows pretty well in no way makes teaching that subject to the population bad. Your assertion would require that believing demons cause sickness actually has benefits we've lost.

iteria · on Sept 19, 2024

But more people know what bacteria are at a baseline level and what they do with diseases than before when all we had were demons/bad humors/etc.

There are functionally illiterate people too in modern day and the average reading level is still elementary school level, but that's vastly better than before when the average person couldn't read at all.

jackpirate · on June 24, 2024

I think you're wrong.

Suicide does not have stable reporting rates. It was very stigmatized in the past, and so investigators would notoriously report suicides as "unknown cause of death" if they could.

Violent crime, on the other hand, is much more correlated with things like poverty than with mental health.

I think it's quite obviously the case that there are no clear indicators about what "mental health" looked like 100 years ago and there. Any projections into the past will involve a lot of extrapolation and have all sorts of biases.