Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: Whole Yandex Git repository leaked
602 points by coolspot on Jan 25, 2023 | hide | past | favorite | 340 comments
Someone just published 40Gb+ of leaked Yandex GIT repository. Won’t provide magnet here, but it is top google result for “yandex leak” when filtered by last 24h.

Affected services:

aapi.tar.bz2 admins.tar.bz2 ads.tar.bz2 alice.tar.bz2 analytics.tar.bz2 antiadblock.tar.bz2 antirobot.tar.bz2 autocheck.tar.bz2 balancer.tar.bz2 billing.tar.bz2 bindings.tar.bz2 captcha.tar.bz2 cdn.tar.bz2 certs.tar.bz2 ci.tar.bz2 classifieds.tar.bz2 client_analytics.tar.bz2 client_method.tar.bz2 cloud.tar.bz2 commerce.tar.bz2 connect.tar.bz2 crm.tar.bz2 crypta.tar.bz2 customer_service.tar.bz2 datacloud.tar.bz2 delivery.tar.bz2 direct.tar.bz2 disk.tar.bz2 docs.tar.bz2 drive.tar.bz2 extsearch.tar.bz2 fuzzing.tar.bz2 gencfg.tar.bz2 groups.tar.bz2 helpdesk.tar.bz2 infra.tar.bz2 intranet.tar.bz2 investors.tar.bz2 it-office.tar.bz2 jupytercloud.tar.bz2 kernel.tar.bz2 library.tar.bz2 load.tar.bz2 mail.tar.bz2 maps.tar.bz2 maps_2.tar.bz2 maps_adv.tar.bz2 market.tar.bz2 metrika.tar.bz2 mobile-WARNING-notfull.tar.bz2 nginx.tar.bz2 noc.tar.bz2 partner.tar.bz2 passport.tar.bz2 pay.tar.bz2 payplatform.tar.bz2 paysys.tar.bz2 portal.tar.bz2 robot.tar.bz2 rt-research.tar.bz2 saas.tar.bz2 sandbox.tar.bz2 search.tar.bz2 security.tar.bz2 skynet.tar.bz2 smart_devices.tar.bz2 smarttv.tar.bz2 solomon.tar.bz2 stocks.tar.bz2 tasklet.tar.bz2 taxi.tar.bz2 tools.tar.bz2 travel.tar.bz2 wmconsole.tar.bz2 yandex_io.tar.bz2 yandex360.tar.bz2 yaphone.tar.bz2 yawe.tar.bz2 frontend.tar.bz2




If you want to know what's inside archives without downloading them I'm slowly working on my blog post about this breach. Will try to write a bit about affected Yandex services for those who never been interested in russian internet segment.

Also uploaded file lists from most of archives:

https://arseniyshestakov.com/2023/01/26/yandex-services-sour...


I have seen your article. This bit caught my attention:

> All files are dated back to 24 February 2022.

If a coincidence, pretty interesting.


In case people aren't aware, that day Russia made a major escalation of the Ukrainian invasion that began back in 2014, starting with the announcement from Putin about a "special military operation" beginning in Ukraine by Russian forces.

Unlikely that was the day of download, it's common practice to mask last-modified/last-accessed/created-at timestamps in dumps, by setting it to some significant date or just initial unix timestamp.


not a coincidence yandex has its own VCS (called arkadia). But not all services used it, some used github public and private. After war started, they had to migrate everything to internal vcs for obvious reasons.

So it makes sense they stopped committing to other repos somewhat around that date.

I don't have any inside knowledge now, but my guess would be that the leak is from 'on-prem' github.


It's not coincidence :)


Yandex reverse image search is very good. I use it more than tineye, bing, or goog. It either gives you the exact matches if it can find them, or else it can infer what is desired and show many similar matches.


I wonder if that's a function of their technology or their database. Google's image search used to be good, but regulations have forced them to cripple it in some ways, and they might have chosen not to surface results in some contexts.


I don't seem to recall any big story about regulation at the time. It seemed that they carelessly had their internal facial recognition tools (probably also used in Picasa, etc. grouping by person) run on all public content, and the results were feed into public image search. Then people started to notice that Google knows about ALL of their photos posted on the Web, and that probably negatively affected their opinion on the prospects of sharing all their photos and personal data with corporations.

Also, such data collection abilities are generally limited to governments, so it was clear that many of them (US first and foremost) would ask for exclusion of certain individuals, and so forth, and so on, so the public tools were crippled.

Yandex image search does have some facial recognition, but it also seems bit-starved and/or mixed with text search (there's a bigger chance to match if the name and surname is present).

Also, Google is pretty Victorian about porn these days. It's almost like it has a whitelist of “acceptable” porn sites to suit the tastes of potentially angry old ladies.


Not only regulations, Chrome's builtin image search uses Google Lens to find products to sell you instead of simple reverse searches. It has become useless.


Agreed. You can currently do two-step process to get to the old useful search by clicking "Find image source" in the side-bar Lens results, but it's inconvenient enough that I barely use it any more.

In some browsers/profiles I use a "Search by Image" extension that still works properly.

When Chrome first made Lens the built-in image search, there was a way to turn it off. Does anyone know if that's still possible by some more-hidden method?


100% this. Google Lense is a product search engine, not image search.


nitpick: "lense" is not a word.


Agreed, nobody's life improved after they switched to that. There's stll a "Find image source" button above the picture in Google Lens that more or less does what you want


Totally. Even the regular search is mostly made of selling links.

And in quantity, it seems like 9/10th of the content is gone.


It's not regulation. Google just has chosen to cripple it, the same way Twitter and Facebook mostly chose to censor certain viewpoints. All of these groups thing they are doing good.


That's part of the reason but it's likely also due to what user ogurechny explained. Most consumers have no idea about the true capabilities of their databases and it would scare them if they understood. Take facial recognition as an example. Also a non-crippled reverse image search would be the perfect stalking tool. Simply a lot of ways they could bring lawsuits and regulation on themselves so they tone it down just enough to not lose their search engine hegemony.


>but regulations have forced them to cripple it in some ways

the regulations don't apply to Microsoft apparently, because even goddamn Bing has been better at it than Google for years now.


I never realized that reverse image search was so bad because of regulations. Can you give me some more context about which regulations affect google reverse image search and why?


I don’t know about reverse image search but I believe the previous poster had this in mind:

https://time.com/5163852/google-view-image-search-remove/


Another story about Google image search when they removed "View Image" button just to end the feud with Getty Images.

https://www.digitaltrends.com/computing/google-alters-images...


Yandex regular search is very good too, especially at surfacing results which appear to be censored by Google. The only problem is the interface language keeps flipping back to Russian.


I agree. I use it when I'm searching for "iso" torrents.


Is there a meta search engine that highlights results from Yandex that are missing from or ranked much lower on Google?


Searx used to highlight the source engine. I don't remember ever seeing yandex though.


Yandex translation and maps were also substantially better than Google in Russia when I traveled there in 2019


They are better not only in Russia. Many Google services feel outdated, sloppy and overcomplicated after using Yandex for a while.


It is not hard being better at translation than google at this point. Probably the best option right now is deepl.


I like that I can just drag & drop or paste any image into Yandex Translate website and it translates anything, also it has a better language detection than google translate, so you don't have to waste time to look the language you want to translate from. So far I haven't found any alternative, other than Bing translate desktop app...


Google Lens translates images too.


Our cat once demolished a nice chair in a rental apartment. The chair belonged to the owner. At first we could not find where to get a replacement, but yandex reverse image search helped to my surprise!


It's also unexpectedly good at finding the original full images from cropped versions.


Not just cropped, the perspective and colors can be wrong. That is, you can take a photo of a poster on a wall from an angle, and it will give you the original image. It's impressive.


I think they also made it dumber after Bellingcat used it to identify a secret service officer. Interesting to see if we can find any trace of that in the git history.


Allegedly it's snapshot data, not the actual Git repositories.


By the size of the dump, my guess is that each tarball is a repo with the .git or .svn dir intact. But I cannot download it now.


Their Street View product is also very good, however their coverage is mostly for ex Soviet Union countries


heh, I remember looking up some satellite maps of French prisons on Google maps. They were blurred!!!!

https://www.google.ca/maps/place/Centre+P%C3%A9nitentiaire+d...

A quick trip over to Yandex, and there they were in their full glory:

https://yandex.com/maps/10502/paris/?l=sat&ll=2.340173%2C48....


It's funny when countries fuzz up their google maps images of military bases, or oil companies, or whatever. The US military clearly thinks doing that is a joke: you can zoom in on area 51 and look at parked aircraft.


The US just knows where all the satellites are well enough that they hide the good stuff inside when something's scheduled to fly over.


Good on Google for telling us what places to pay attention to ;)


Probably because of legal compliance than technical reasons


Had no idea this existed. Fascinating


My understanding is that it has reverse facial recognition as well. Very cool. Also, it’s an interesting foray into the ethics of AI in that this could easily be used for nefarious reasons. (You can snap a photo of anyone who’s been indexed and find them online very easily.)


Agree. Here are two comparisons of some searches - [https://i.imgur.com/RmMWSG4.png] and [https://i.imgur.com/sIOhx6y.png] In 2022 even Bing beats Google..


Hmm, didn't know they had one, but I'm let down by Google's 90% of the time so I'll try it next time.


I actually doubt that a lot of that magic will be shown, as models ( because of ml ops) won't be in code.


It also sucks for privacy.


Its good for privacy if you don't live in Russia... I don't think that they share the data with US governments?


Eh, it shares data with Russian government.

If I have to choose between the two, I'll go with the US one.


why?


Because the Russian government is much more dangerous to me than the US one.

Russia sent agents to blow up an ammunition depot ~15 kilometers from where I lived (google "Vrbetice 2014" if interested), killing 2 people. Russia regularly poisons and assassinates people. I mean, I'm no big fish to be interesting for Russia, but still the comparison comes out clear.


Russia is currently trying to invade and conquer even larger parts of Ukraine and a significant part of the world is helping Ukraine defend itself at the moment.

I prefer handing my data to privacy-breaching companies if the alternative would be handing data over to a state my government is currently holding a proxy war against.


US right now invaded Syria and stealing Syrian oil


And if I were Syrian I certainly wouldn't want to use American services with the Russian government supporting the current regime.


Lmao, bc Russians will murder me for being gay while the authorities look the other way, amongst all of the other atrocities that their state is perpetuating right now.


Remember this yandex.com is hq in Europe. yandex.ru is hq in russia. Search carefully.


This is not the first leak from Yandex. So they might "share" data with everyone. I used their Yandex food service. My personal info was eventually leaked and fraudsters started calling me. Not a huge issue but not a great situation either.


I tend to think that it's not the govs tracking us that we should care about but the big tech. Yandex is one of them.


There are multiple levels of tracking... Governments being the top levels, then maybe global communication companies along with big tech titans... etc... AkA a complicated forest and/or tree of data?


That's complicated indeed. When I start thinking about who can track me (wi-fi, bluetooth and RFID scanners and surveillance cameras everywhere, my telecom provider, banks, public transportation, gmail etc., browser fingerprinting…) I just don't see any practical way to ultimately hide. Add the fundamental lack of trust to any device using ICs (evidence-based, field verifiable) and it becomes just insane.

In practice I just use a dumbphone and turn on my smartphone only to get a taxi (Yandex) or maybe see if my bus is coming (again Yandex). But my wife has her smartphone always on spying on us, not to mention all my colleagues. Also on my laptop I do use Yandex market, maps and read articles on Zen…

I can imagine that deep awareness of digital security issues may make one's life better in some situations but there's always a risk of becoming too much of an activist here. I think I'm quite geeky already.


> That's complicated indeed. When I start thinking about who can track me (wi-fi, bluetooth and RFID scanners and surveillance cameras everywhere, my telecom provider, banks, public transportation, gmail etc., browser fingerprinting…) I just don't see any practical way to ultimately hide. Add the fundamental lack of trust to any device using ICs (evidence-based, field verifiable) and it becomes just insane.

It is unfortunate that there appears to be only one way to reclaim our privacy,and that is through the passing of new laws and regulations. However, due to the way that laws are passed in the United States, I am not very optimistic. For instance, sometimes unrelated laws are attached to necessary budget bills in order for them to pass, or a three hundred-page bill of legal jargon is given a misleading title such as the "PATRIOT" Act (The unpatriotic bill). Furthermore, both political parties do not seem particularly supportive of privacy, if anything they appear to be opposed to it. We need to somehow make this an important political issue in 2024.


Unfortunately it's impossible to fully elaborate my position here in the format of HN comments but I don't think the way you're talking about is really possible. The system is broken so badly that it cannot be fixed from inside by adding more of its rules. It has to be changed entirely. Figuratively speaking it's sort of like Goedel's theorem.

Have you had an opportunity to discuss their views on privacy and the currently established political culture in the US with those in their 70s and 80s?


And then we have China, where the big companies are required to share the data with the government if they request it.


> And then we have China, where the big companies are required to share the data with the government if they request it.

That's one thing that is probably the same in most countries... at least we know it is also true for USA and Russia.


Then the governments requesting from the big tech your infos (and getting them). Same thing in the end.


This is true. But 'the government' (I'd rather say agencies and services) mostly doesn't care about us unless we do something 'special' and you basically can't hide from them. The big tech is always interested.



It wasn't google-able per the OPs instructions, but this did the trick. Thanks digianarchist!

Archive links:

https://archive.today/h5XJs

https://web.archive.org/web/20230125224316/https://breached....


It can be found on Yandex too!

https://i.imgur.com/rxYINhF.png


How is this site allowed to be live?

Yandex source is cool. But there are a lot of leaks with people private data

The US authorities move mountains for TornadoCash, Z-Library, etc... why leave this one?


"Private people" don't have a powerful lobby group.


"Since this is leak only contain contents of git repositories there is no personal data." https://arseniyshestakov.com/2023/01/26/yandex-services-sour...


This is basically a clone of RaidForums, which was taken down.

https://raidforums.com/

It's a cat-and-mouse game that will likely never end.


>The US authorities move mountains for TornadoCash, Z-Library, etc... why leave this one?

Nobody is moving mountains for those, what makes you think that?


Anyone else's download stuck? I haven't used torrents in forever, not sure if this is normal or if there's a workaround.


Most likely your ISP have throttling for torrent bandwidth. Use VPN.


Mine is stuck and my ISP has never throttled anything.


  % bzgrep "BEGIN PRIVATE KEY" \*.bz2
  disk.tar.bz2:Binary file (standard input) matches
  drive.tar.bz2:Binary file (standard input) matches
  extsearch.tar.bz2:Binary file (standard input) matches
  ...


People check in fake private keys to git repos all the time for testing. My own tests have private keys too. They're just sample, unused, publicly advertised private keys I found online. They're useful to make sure your code is working end to end with some private key.

EDIT: For example, here: https://ospkibook.sourceforge.net/docs/OSPKI-2.4.7/OSPKI-htm...

or here: https://docs.vmware.com/en/VMware-NSX-Data-Center-for-vSpher...

or: https://www.ietf.org/archive/id/draft-bre-openpgp-samples-01...


It could also be a script that imports a private key and searches for the string BEGIN PRIVATE KEY.

Likewise if someone searched HN for this string he'd find your comment (:


Checked in private keys are fine if they're just used in tests, local development, etc.


Technically fine yes but from a habits and practice standpoint it's safest to stick to a "not ever" rule and work around the limitations.


Checking in fake private keys is fine for testing. Why is it bad, out of principle, just in case you check in bad private key? I think that's a bad argument because there are a lot of benefits to being able to run end-to-end tests with some key.


Care to explain? Keeping private keys inside the repo sounds fine for me as long as these keys are only used for local development, they are rotated regularly and are only valid for localhost (in case of TLS certs).


Not GP: If you make it normal to check in credentials and keys, then the risk of accidentally checking in prod secrets increases. It's basically making it comfortable for devs to deal with keys in repos and I think that's inherently dangerous.


You should be using automated checks to keep credentials out of your repo, not relying on individual developers. And those checks can have explicit exceptions for known safe/public/test keys, just like you might explicitly allow testing or fake credit card numbers.


yolo


Nothing surprising. The development culture was shit back there.

Though I would expect these keys to be just some stub config values which allowed engineers to quickly run the shit locally.


Notably Yandex.Search, Yandex.Maps, Yandex.Disk (like Google Drive), Yandex.Mail, Yandex.Alice (voice assistant), Yandex.Taxi (like Uber), Yandex.Delivery (like Postmates), Yandex.Drive (like ZipCar), Yandex.Robot (idk, prob advanced AGI from the future), Yandex.Pay (like Google Pay), Yandex.Docs (like Google Docs/Sheets) all source codes leaked.


There you go, really working and scalable solution for running Mail, Storage, Voice assistant and Documents on your infrastructure and not depend on 3rd parties. End of Joke.


From quick googling(yandexing?) Yandex.Robot appears to be a self driving briefcase.


Now let's ping r/selfhosted to get this working locally and we are all set.


Yandex.Robot is probably a food delivery robot.


Probably it's just a search bot like Googlebot.


Yandex.Robot is a delivery self-driving robot. The crawler bot is a part of Yandex.Search.


I was super impressed when I interacted with Yandex ML engineering people like a decade ago on a conf call - they were trying to sell us their services. Very smart and very clearly outward-looking. Haven't followed the company in detail since. Wonder what happened to them. [removed inaccurate details]


> One wonders if this is revenge from the regime.

It's not a revenge of the the regime. A decade ago it was a different company, now they all are completely under the cap of FSB, it's basically a Lubyanka filial. It's serfdom all over again, not that strange it got leaked by some unhappy employee.


Reading up on the current situation with the company - yeah, I think you're right. :/


When you're judging them, think about the space of possibilities as well. Is any example of a Russian business of such volume not tied to the government?


I mean they were pretty much forced to sell business to the government recently. Coincidence? Maybe…


what business and to what government? are you in your dreams?


i guess it's about recent exchange of their news service to vk. not literaly sold to the government

https://techcrunch.com/2022/09/12/yandex-news-zen-vk-sale-co...


Its not only news service. Few services are left under the Yandex INTL. What are the services - ... you gotta know... they haven't published a proper disclosure, only couple of press-releases, which is alarming for a still part-public traded co.

I remember some sources that they've sold ALL their media services, and the INTL is only left with services like taxi and delivery.


A lot of really great technology and a lot of really great people.

There are shitheads everywhere.


Is this well known? Seems likely the leak was done by an anti-Putin employee or ex employee in revenge/protest against the war if yes.


I saw the forum post in this thread claiming leak happened back in 2022. This aligns well with the massive PII leak they had before so I would say there is a chance.


Yandex was probably one of the best and the most innovative Russian companies, and attracted the best workforce. But it cooperated with Putin's regime long before the war. It censored opposition narratives in its search results and promoted pro-Putin content [1]. It continues to promote Russian propaganda even today (try any Ukraine-related search query in Russian). Now it's under the oversight of a Putin's man, Alekei Kudrin [2].

Sadly, Yandex is not a neutral company and is just another weapon in Putin's hands.

[1]: https://misinforeview.hks.harvard.edu/article/a-story-of-non...

[2]: https://t.me/AlekseiKudrin/48


Don't assume that just because people are highly educated and intelligent, they will be on the "good" side.


How do you think a company that has all its business in Russia should act?


What are the short-term and long-term implications of this?

I assume a drastically increased attack surface and potentially a boon for open-source development? Anything else?


Yandex might be the most popular search engine in Russia, but also they're by far the most popular logistics company for private people here. They do deliveries (food, groceries, other goods), taxis, moving vans and so on - all strongly Yandex branded. If you walk around the streets of Russia's big cities, their brand is absolutely omnipresent (and also in some former Soviet countries, like Armenia).

Eating into their business would require much more than source code, but of course an analysis of the code could lead to finding more security issues.

> potentially a boon for open-source development?

That'd be an absolute copyright/licensing nightmare, just because the code was stolen and published doesn't mean it is now "open-source".


In Russia the copyright rules are more relaxed though. As long as you stay small nobody will enforce it


Is it different in other countries? Who checks that local pizzeria uses genuine Windows?


Windows? No one unless disgruntled ex-employees blow a whistle. Other stuff like pirated pay-tv streams though? These absolutely have agents on the streets scouting out bars and pubs to check if they're paying customers and if they are actually using an (expensive) commercial license or if they are mis-using a residential one.


Microsoft.


No, they are not. Laws are about the same.


It’s all about enforcement though. In North America you can’t walk into walmart and buy a pirated AAA game title with keygen and mods preinstalled on the disk


Part of me is seeing someone with GPT prompt asking for a 'rewrite'. We truly live in interesting times.

For the record, I don't disagree with you on the licensing/copyright front.


I think a “clean room” reimplementation should be doable here. I. e. one person writes the specs based on the code, another one writes the code back from the specs.


In fact the damage would be mostly reputational.

I expect the code to be mostly worthless. There is just too much of it, it's poorly documented and, oftenly, just badly designed and badly written.

And the actual important data (index shards, voice models, all that crap) is not in these dumps.


Probably Yandex will start watching their infrastructure very closely, and go through all known attack vectors that haven't been prioritized before to fix them ASAP.

Won't be a boon for OSS, any author would be idiotic to read stolen source code and then decide to create a OSS library/project based on what they learn from it.


any author would be idiotic to read stolen source code and then decide to create a OSS library/project based on what they learn from it.

Ah, but will it make its way into Copilot? That could be interesting.


Oh that code, Copilot generated it. I m innocent.

The Copilot Defence


I must be writing very obvious glue code because I’ve only ever had copilot suggest very obvious glue code.


Very obvious glue code is 95% of what I write. It's the remaining 5% that justify my salary.


Making mark on machine: $1 Knowing where to make mark: $9999


> any author would be idiotic to read stolen source code and then decide to create a OSS library/project based on what they learn from it.

What? Why? Isn't this what software developers do — they read a lot of code; they find ideas they like; they mix them together with their own ideas while building something. Isn't this how learning works in general?


> What? Why? Isn't this what software developers do — they read a lot of code; they find ideas they like; they mix them together with their own ideas while building something.

Not normally. https://en.wikipedia.org/wiki/Not_invented_here#In_computing

More to the point: an organisation will usually use cleanroom techniques to clone something like this, if they choose to do so. Anyone who's been in contact with the original source code is "polluted": https://wiki.winehq.org/Developer_FAQ#Copyright_Issues


Worthy to note that the "clean-room" doctrine is not required and even discouraged legally, in favor of more direct source code reading and idea extraction (See Sega v. Accolade and Sony v. Connectix, in which the latter court commented on clean-room being inefficient and the kind of wasted effort that fair use was designed to prevent)

The whole point of the idea/expression dichotomy is that copyright doesn't protect ideas. To that end, extracting the uncopyrightable elements out of copyrightable code is more than legal and court-endorsed. Wine's approach just makes me sad, lots of wasted effort on black-box testing where a disssembly of the real code would be legally in the clear and allow it to advance at much better speeds.


You aren't allowed to use cleanroom techniques to clone something that's been stolen. You're still guilty of copyright violation for even downloading the code in the first place, and who knows, maybe also laws against receipt of stolen goods. Cleanroom reverse engineering is usually applied to a product that you've actually bought and paid for. You'd have to be really crazy to try and use this code for anything.


Yeah, reverse engineer the end product. You can't have one guy in the office looking at leaked code writing up detailed specs of how the code works and then giving that to someone to write.


God copyright/IP is so stupid, we should get rid of it. People should be able to just download and use this code directly



It is. There is this hysterical idea that reading code somehow makes you "tainted" when it comes to ever "safely" implementing something similar.


> Won't be a boon for OSS, any author would be idiotic to read stolen source code and then decide to create a OSS library/project based on what they learn from it.

This is naive. This generation seems very sensitive to the prospect of computer crime.

The stolen source code will almost certainly be read, and if deemed novel enough will be turned into open source projects. It may be tough to figure out those projects are derivatives of stolen code, but most likely they will be passed around in black market repos.

I looked through some of my telegram channels to see if anything has been posted yet. Lo and behold, the stolen files are in fact available… from a server in Ukraine.


>This generation seems very sensitive to the prospect of computer crime.

You're right. I remember a time, maybe 20 years ago, when stuff like this would be generally appreciated by any community of hackers.


I think the problem is people are entering a matured industry, all they know is the professional buttoned up culture we have today. People have no further curiosity beyond working and making money in tech.

They do not have memories of a time when people tinkered around doing all kinds of crazy and possibly illegal stuff, just for fun, just to see if the could. Sad really.


Tinkering is fine. Even leaking and study is fine.

But re-licensing someone's else code is at the very least impolite. People avoid using BSD-licensed code in GPL code base, or vice versa, just to not violate an open-source license of some fellow hacker who won't sue (but would yawp on Twitter, or something).

Also, cutting proprietary stuff out of the thick scaffolding of proprietary dependencies is often hard. Removing proprietary cruft / tech debt may be even harder. A rewrite from scratch (not clean-room though) may be just easier.

This is to say nothing about any serious business unwilling to depend on code of unclear provenance, with possible legal complications attached; there goes adoption.


> People avoid using BSD-licensed code in GPL code base

Huh? I don't believe they do that. It's completely fine to use BSD in GPL code, but not the other way around.


It is fine to do either but the result is always a GPL license.


I actually laughed when I read this. It is a terrific, concise explanation of the nature of both licenses.


The code keeps the original license, please don't go sticking "GPL" on the BSD bits when redistributing the source.

Binaries, yes -- either more or less restricted depending on which side of the ideological fence you're sitting on.


> People have no further curiosity beyond working and making money in tech.

For some, sure.

There are also many who still explore, but qualitatively the overall culture feels very different.

Unfortunately, there are employers who seem to want their employees to tinker, but own virtually everything they do, all while still playing lip service to supporting work-life balance, including parenting. So we get mixed messages and weird hiring priorities.


> There are also many who still explore, but qualitatively the overall culture feels very different.

I thought the same, then one day I stumbled onto a discord populated by teens and college students who were doing the same sort of shenanigans my friends and I were up to back in the day.

The S/N ratio is worse, but there are still people out there having fun!


Fair enough. Still, after many decades in software, the current ethos feels so different as to be unrecognizable from where I started. I learned to program on an Apple ][+. To start, I was not business driven; I primarily explored software and hardware, without a particular business end in mind.

Later, after studying engineering, I moved more into caring about problems and products. Over time, I found more and more companies wanted the benefits of having skilled software developers, but they seems to primarily want the businessified version -- solve the problem at hand while punting obvious existential problems down the road.

Sure, many Silicon Valley startups say they want "passionate" programmers, but they didn't usually embrace the tradeoffs that go along with it ... time and resources to explore, supporting risk taking, and fostering a culture where some experimentation is its own end (not merely a means).

Worse, these same companies that claimed to be business focused didn't even engineer well.

So often I saw a double failure: a lack of both the long-term exploration and tactical execution. Somehow both got lost.

There are exceptions of organizations that have healthy, adaptive cultures that internalize continual assessment and improvement, ranging across team interactions, software quality, user experience, and organizational processes.

Perhaps I was too naive and didn't fully appreciate the difficulty here; my progression in engineering has generally coincided with me lowering my expectations. I mean this as a trend line only; I have seen some impressive exceptions, but they tend to be unusual and fairly narrow in scope.

For example, the best product vision I've ever seen was coupled with a seemingly distracted CEO.

This is just one thread of my experience, I'm sure others have seen many other things. I for one would really enjoy reading reflections on working in the software industry.

For myself and perhaps other "passionate" programmers, learning to dial down "caring" and attempting to fit in with imperfect cultures is very hard. It would be one thing if the people in control seemed to understand the situation and showed empirically good results.


How did you stumble across such a group? I have been looking for a community like you describe since 2018, when I left the group I was part of at the time.


The punishments are now somewhat known. I used to tinker with things that were questionable but after seeing what happened to swartz and others i stopped. It’s not worth it.


Swartz was going to get a custodial sentence of less than 6 months, quite likely just probation. A slap on the wrist.

What exactly did you expect the punishments to be like? A fine?


It is widely suspected that was the goal.


Not really. If you value your own work, have things you have chosen not to give away and therefore wouldn’t want someone else to steal them, then it makes logical sense to consider the value of other people’s work and not feel comfortable about using a stolen version.

Not everyone has to share a political point of view, so while some people are anarchists and are happy to just have a giant free-for all others believe in the concept of ownership. That isn’t a lack of curiosity or a professional buttoned-up money-chasing approach.


I think OP is lamenting the fact that the hacker culture shifted away from what you describe as "anarchism" as the mainstream position. It really was a lot more common among people who self-identified as hackers 20 years ago than it is today.

More broadly speaking, I'd say that the cultural distinction between "hacker" and "corporate" was much wider then, and it manifested in this manner among others.


I've been programming and part of "hacker" culture for over 40 years now so I get it.

20 years ago you would get people lamenting that 20 years before that everything was all hippy sunshine and rainbows and homebrew computer club, phreaking and shareware etc, and that wasn't true then just like it isn't now.

I don't think anarchism has ever been the mainstream position at least never in my experience. It has always been present but always been fringe.


Hangout on different forums


Where?


Maybe, but I remember when the Windows 2000 source leaked, and the ReactOS developers started forcing all contributors to sign a statement saying that they hadn’t read it.


Yup it'd end up in random places just like fast inverse square root.

Origin will be forgotten and that will be that.

These days everyone whack code on GitHub I wonder if it'd be possible for GitHub to review repos for likely derivative of leaked code?


If GitHub did that then there's a chance it would flag Copilot-written code, revealing their public narrative as a lie, so I think they'd quickly kill any such proposals.


Black market repos? Who host those?


> Won't be a boon for OSS, any author would be idiotic to read stolen source code and then decide to create a OSS library/project based on what they learn from it.

With the current geopolitical situation going on, is this really true? (From a western developer's perspective)


Even if Russia were at war with the world, it would still be unethical to loot its citizens' property.


Intellectual property is a social construct that society decided was important enough. Often times there is no direct correlation between the "effort" and the "compensation" that IP enforcement enables.


Property rights are a social construct regardless of whether the property in question is intellectual. So what?


You are correct. Any bit of property that you aren't personally in physical possession of is a social construct.

Social consensus is the only reason that you can leave your car at the curb, and come back to it a week later, and still expect it to be yours. Social consensus is the only reason that you can own land that you don't personally use. Social consensus is the only thing that prevents people working at your widget factory from deciding on Tuesday that its actually their widget factory, and that they would be better off by cutting out the middleman.

There are very rare, very extreme situations in which this kind of social consensus is very prominently broken.


Yandex is complicit in the war and willingly helping Russian government with censorship. Its existence is unethical.


It should be treated the same way they've treated royalties.


the US and russia are both parties to the berne convention, the law still applies even if a bunch of misc russian people and orgs had their assets seized


It's not about whether it applies, but whether there is any will to enforce it.

Would the US government cooperate in the indictment and prosecution of a US citizen, on behalf of a de-facto enemy nation, for a company that has ties and is allegedly controlled by the government of said nation?


it's not criminal law, yandex can hire a lawyer in the US


But could they find a judge/jury sympathetic to any Russian concerns?


It's technically a dutch company. They had to do some business manoeuvres to continue operations.

https://ir.yandex/


courts decide much more based on law than sympathies, seriously


Usually yes, but not always and I think we all know this.


if you focus on the outliers, you're missing the picture


You saying that US judges lack the integrity to issue impartial judgement?


Yes, I’m partially implying that and the records speak for themselves.


I would love to see the records that you are referring to.

Not isolated incidents, rather something suggestive that one could confidently bring a sympathy-based argument in deference of law and expect a high probability of succeeding.


I wasn’t saying anything about sympathy. If you think US courts are fair, I’d suggest you do your own research.


What records are you talking about?


The history of Justice in the US.


The "history of justice in the US" is an extremely broad generalization. Justice has had a lot of history. If all of it pointed to judges being unable to be impartial than the legal system would likely have failed long ago. Which specific occurences are you referring to? Just one is fine, it only takes a single positive to disprove a negative.


yandex is currently based in the Netherlands


They are still the most popular search engine in Russia, they have the best brand, etc. Intangible stuff like that is hard to copy. Running a search engine on the scale of Yandex is very expensive so I don't think that they are going to be replaced by some startup that copies their code and adds a few features. Probably a bunch of bugs will be found but that's not the end of the world. Apparently they handle images pretty well so parts of that may be (illegally) copied in some places.


Knowing how stuff like anti-spam algorithms and ranking algorithms work in order to abuse them is probably the much higher value here.


The code is lot less worthy than the data for the models which is lot better guarded and is not in this leak.


Probably true. I don’t know much about exploiting algorithms like this, but doesn’t even knowing how the algorithms work give a somewhat substantial advantage in stuff like SEO?


From my own understanding - not really, most of such stuff is done with ML models. Code worths almost nothing there.


I use Yandex when DDG and bing censor real results.


I can still see many companies trying. If the opportunity is there to make money off of this leak then someone will attempt it.


Using stolen code in an open-source project seems like a bad idea.


Someone could still get the uncopyrightable ideas out of it though.


Sort of. There may also be pantents involved. It'll be specially hard to defend against it when your inspiration came from the patented code itself.


Assuming code can even be patented, which generally it cannot.


Then we can train FancyCodeGeneratorModel on it to bypass this unfortunate licensing problem.


Damn, that's not a bad idea, "convert this code so that it is indistinguishable from the source"


That shouldn’t be too hard...


Does the USA care to enforce Russian copyright law anymore? For the near to medium future it seems unlikely to be a priority.

Still not a solid foundation to start with, though.

Solution: Take kernel of the idea and implement it yourself.


All of those, assuming there's nothing malicious in the code. Google see if any of their code had been stolen, not like they can do much right now.


magnet:?xt=urn:btih:7e0ac90b489baee8a823381792ec67d465488fef&dn=yandexarc&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2F9.rarbg.to%3A2920&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969&tr=udp%3A%2F%2Fbt1.archive.org%3A6969%2Fannounce&tr=udp%3A%2F%2Fbt2.archive.org%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce


Note that Yandex currently seems to be made of “Yandex teams relocated out of the country” (say, 1/5th of the company, I don't have exact numbers), “Yandex teams inside the country”, and “Yandex brand services sold to other company”, so the internal access situation is complex.


I imagine a combination of COVID and the Ukraine War has contributed to a change in access security policies.


Seems like the main repository ("arcadia") isn't there.


Looked around and it seems like arcadia is a sort of VCS on top of git/a build system?

https://github.com/yandex/CMICOT/blob/master/.arcadia.root


When I've been there arcadia was the primary SVN monorepo where everything was stored (and built by custom flavour of cmake). From what I heard it's still SVN and git was completely eradicated in the company.


This is pretty interesting to me. Usually companies will gradually switch over to git. This is the first time I've heard of a company using git and ditching it in favor of older technology.


Yandex used (maybe still using) SVN monorepo named Arcadia for most of the source code. As it became too big, they developed an in-house VCS called Arc and were switching to it; there is more context in a presentation about Arc (2020), transcript/recording in Russian[1]

I could assume that this is the leak of Arcadia, and each archive roughly contains one top-level folder.

[1]: https://habr.com/en/company/yandex/blog/482926/


git is inefficient when you need to handle a huge monorepo where just one checkout weights 66 GiB and the history is measured in tens of terabytes.

The other problem not addressed by git was the support of per-directory permissions.


Google also uses something close to SVN and explicitly avoids git. I don't think there's any sense of progression between the two. I think it comes down to which one fits your processes better.


While reasonable people can discuss how "close to SVN" Perforce is, my understanding is that Google was using Perforce for the longest time, or a tool modeled very closely upon it


In case of centralized development git creates too much friction, svn works better there.


Has anyone found the FSB back-door portal?


Do they need a back door? Won't they just pick up the phone and ask for whatever they want. It isn't like anyone in Ruzzia is going to say 'no' when the FSB calls.


Everyone can say no. Once.


Yandex cooperates with Russian government, no need for a back door.


I imagine it's just an old good admin panel with a normal login.


c'mon, don't say that as if you haven't found it


Anyone have insight what “skynet.tar.bz2” is?!


Skynet is a command execution and metric collection layer used in basic cluster ops. Nothing cool to be honest. Essentially it allows you to run a command on all the hosts matching a predicate.

This was the easiest way to disrupt a whole datacenter for those who had enough privileges, I did it at least once. Also several times unnoticed bugs in my code shut down literally everything for noticeable periods of time. I've been developing the search orchestration system.


There's no confirmation about the number of machines the script is going to affect?


Yeah, it was extremely easy to fuck up. In some cases though you might have wanted to run something on thousands of hosts to debug a tricky issue. Bad practices, I know.


Probably, but irrelevant. What you really need is a confirmation if the sent command will lead to bugs/downtime or not. If you think the command is working 100%, why would you say no to run it on 1000s of machines all at once?


It's eventual consistency all the way, so it actually doesn't really know how many machines are there


That does not necessarily mean it could not provide some useful info about that. E.g. it could run a ping on all eligible machines and show total count of responses in real time for the currently typed filter.


Assuming Google Translate is accurate, it's a server management solution.


Interesting that most of the code comments are in English.


When I was learning C++ in late 90s in the university the DOS environment I was using didn't have the i18n configured properly. My teacher noticed that I was trying to use Russian transliteration for comments and she suggested to 'just use English'. That solution didn't seem obvious to me at all.

Also some other day she suggested someone else to install Linux at that student's home PC. I followed that advice too and learned a good bunch much of my English by reading man pages.


It's inconvenient to switch the keyboard layout each time you write a comment.


New AZERTY also has Greek in addition to Latin, but sadly not Cyrillic (might be a nice project ?)

https://norme-azerty.fr/en/


There are many "Cyrillic" variants.


It is not really - if standard libs, APIs and frameworks are in English, it is either this or some weird 50%/50% mix, which is often ambiguous. It is a rather standard practice in all non-english speaking countries.


In France we're more 100% or 0%. You have people who really really want to use English and end up mistranslating most of the business logic in confusing spaghetti of almost-English and false friends, and you have people who tell you that to avoid being outsourced to India, we should comment everything in French. I worked on a spaghetti of bad English long enough that when I started having some power, I made us switch to use French everywhere, even variable names: this sounds awful at first, but then, a miracle door opens and you can sit down with your business expert and read through the logic with him and understand what was made and how it fit what he expected (since, ofc, he speaks no English whatsoever), even hours after having coded it and you already started to forget !

If I had to work again in a French company, I'd only use English if we were all perfectly fluent, otherwise it's just a mess, you have people, I don't know how they even learned programming, who know little to no English at all whatsoever writing variable names and comments in something so confusing and that no French reverse-translation can ever salvage.


Worked 20+ years in German companies, all code and all comments were English, long before remote work (as it is today, all of my coachees have now mixed teams).


I'm not sure if any large company in Russia doesn't write code comments in English.


Programming languages are based on English, idioms are given English names - mixing in other languages is the worst of both worlds.


Another reason is encoding confusion. If the text is not ascii, then you somehow should guess its encoding. For compatibility reason a lot of tools and editors may default to an old codepage instead of utf-8 (they may respect utf-8 if you have a bom), add to this legacy code already written in who knows what encoding. Is it a windows codepage? DOS codepage? Linux? ISO-8859? If you guess encoding wrong, you can corrupt text.


So you can confirm they are legit? Anything look exciting?


While I personally haven't worked at Yandex I can certainly say they are legit. I have a lot of friends who worked there and a lot of intranet URLs in documentation are legit.


Lingua franca with all their contractors most likely.


At first I wanted to give a serious reply, but then I saw your (very aptly chosen, given the context) username and found myself wondering if HN rules allow parody accounts...

However, I'll give one for those who've missed the irony: It's considered gauche to have ubiquitous non-english comments.

Pretty much the only kosher use case is when you have to warn people not to modify otherwise bizarre looking code because of either regulatory, legal, or domain-specific reasons, and listing them in their native language makes more sense, since developers are not legal experts and translating them into English is not in the job description if the app is solely for domestic market.


In many places I worked at in Brazil people will just write everything, even variable names, in Portuguese. That came down to the developers not speaking English at all.

Who considers it gauche to have software written in non-English? The Russians you mean? I certainly find it perfectly acceptable and even desirable depending on the contributor's English fluency.


>Who considers it gauche to have software written in non-English?

2x BigCo's I've worked at, with 1-5K IT staff. Yandex is xbig though, with 5K+.

>That came down to the developers not speaking English at all.

That sounds implausible and not very sane, unless of course you're working for something very static like the government, and the tech stack you're using is set in stone.

Imagine that you're using a popular library and decide to update your dependencies, who's going to read the change logs & the documentation? Sr. Architect?

If you have a security issue and the only writeups are in English, who's going to be trusted to implement a fix correctly if the people can't understand what the problem is in the first place?


Other people in the industry know multiple languages and translate. Also, reading is much easier than writing. So, I'd say that, with maybe help of Google translate, a majority of people would be able to read those. Given I read and write English relatively well, I'd also help out with efforts whenever it was needed.

Anyway, you'd be surprised at how much documentation and resources are available in Portuguese. You can get by alright with it. I did it when I was younger and eager to learn how to code.

I've worked in big companies in Brazil where docs and comments were mostly in Portuguese. And some teams would even use Portuguese variable names.

Just because they do it differently in Russia, it doesn't mean every country is like that. Remember what you said about bubbles and all.

By the way, interestingly, I've briefly worked for the government there once and it was one of the places where English was most prevalent. People there were very open source oriented, which the Brazilian government incentivizes (at least used to back then). So, nearly all code had English comments and docs. Differently from closed source intiatives in the industry where other people reading it was considered a liability and not a goal.


In Russia the chairmen are interested in kickbacks which they won't get with OSS.

Also, it seems that the situation with education in general is still better that in Brazil at least in some regards. In 2014 I worked in São Paulo and had a difficult time communicating with a taxi driver. I was trying to ask him to at least write the amount of money we have to pay for the ride on a slip of paper. Later a colleague explained that illiterate drivers are quite common.

In general Brazil felt mostly like home to me. Maybe Brazil is a version of Russia where homeless people don't freeze to death in the winter :)


It surprises me that you met an illiterate driver, maybe in the 80s/90s. But literacy rates in Brazil are around 99% nowadays and taxi drivers, specially in São Paulo would know how to read and write. However, speaking English is not that common in Brazil. It's a big country and far away from the US and Europe. The neighbouring countries all speak spanish, which is mutually ineligible with Portuguese. So, there's little pressure for people to actually learn English, unless they want to study/work abroad.


Portuguese is still latin, isn't it. I suppose much of it is similar to English. Imagine if everything was, say, ελληνικά.


Yeah. I can see how that could be more difficult. Usually programmers will stick to ASCII in Portuguese, even though technically there are some non ASCII characters in the language. I can only imagine it's much more troublesome in Cyrillic. Anyway, my point was just that not writing comments in English is common practice in some corners of the world


I think it depends on what you're working on. I often find myself having to use legal terms for which there are no direct translations (because they do not exist in English-speaking countries), or there are analogous terms, but they do not mean exactly the same thing, or are very awkward to use (for example, a common abbreviation which everyone recognizes at once becomes something incomprehensible after translating it to English). So we can either use Russian for comments and git commits, or spend a bunch of time digging though specialized dictionaries and force everyone getting acquainted with the code base to do the same.


Not at all. People preferred to use english to feel some connection to the world. That was mostly psychological. A Silicon Valley cargo cult if you wish.

Also the company always avoided any use of external contractors.


People prefer to use English in software engineering contexts because most terminology is English anyway, so even if you're talking in Russian, you're still коммитишь чейнджсеты с багфиксами для релиза and so on. While most technical terms have nominal translations, they're usually so unwieldy that even people who are aware of them don't use them. But if they try to, others will have a hard time understanding them.

And if it's going to be half-English either way, might as well just do it all in English. If nothing else, it saves time switching keyboard layouts.


In mid-00's I used to be frequent at a pub that was popular among expats and oil and IT industry professionals. I met a guy from the US there who was a professional translator and interpreter. He once told me that when he has to translate a business letter he finds Russian really cumbersome but when it comes to poetry he finds Russian so beautiful that English will never sound close to that.

Also he gave me his copy of Stephenson's Snow Crash. When I complained that it was so hard to read he answered that for him reading Lermontov is also a struggle. What a guy!


Highly inflective languages are generally better for poetry because you can make so many more things rhyme with some creative effort. Having a lot of similar sounds (like all the hard/soft consonant pairs) also helps.


And poetry is so much more than the rhymes: it could be that this person didn't really know English poetry as well as Russian poetry, or that he was more easily impressed by exotism. I read French natively and have been living in English for 10 years and can now read Snow Crash-like stuff, and English always amazes me more than French, just because it can be so much richer than I'll ever produce myself, stuck to my French roots.

So if I read at random this in English: https://www.poetryfoundation.org/poems/43844/she-walks-in-be... and this in French: https://fr.wikisource.org/wiki/Les_Contemplations/R%C3%A9pon... , I can feel a mysterious beauty in English maybe I can't feel in a French poem, while the French one explodes in meaning, makes me think more than contemplate.

Rhymes are like a little funny exercise in poetry to add some music, but really far from the goal, or so I hope. I feel it's one of the many inputs: you have the rhymes, the rhythmic internal divisions of the stanza, the symmetry between them across the text, and you can add lots of little funny meta-meaning by reversing meaning in two halves of two successive stanzas, stuff like that. A poem is just a text that does more than transfer information at the first level, but does it on many more. English, Russian or French, they can all do it.


I think you're reading to much into this. It's easier to not switch languages in one task. If you're using an English-based programming language and English libraries/frameworks, you're not going to name your own variables in a different language. Comments most often follow that too. Any time I see someone attempt that, it's extremely distracting.


Not really. I've attended a meeting on the subject where people decided to use English "not to be loosers" and I've heard similar stories from other teams and departments.


Interesting. I have never seen that attitude here in Sweden. If only Swedes are present we speak Swedish despite all our documentation, code and comments always being written in English. Could be due to all Swedes in tech (while most Swedes have good English it is not universal outside tech) being able to speak good English so there is no reason to show off.


And what do they understand by "not be losers" in that case? What are the main ideas behind it?


Something like "if someone can't read English he is a bad engineer, if you write comments for bad engineers, you either are a bad engineer or a loser".

I've heard things like "we should grow up and use English".

They wanted to be an "elite" in contrast with russian-speaking "peasants".

The rational decision to use English was often motivated irrationally and was not about switching keyboard layouts. There were numerous discussions on the subject.

That is hard to analyze, I'm not a psychologist.


The OSS community that I grew up in speaks English and the software world is largely based on English. I really enjoy reading nicely written source, well commented in good English, with concise naming (again in English). And we're forced into a situation where programming languages are based on English. And I also enjoy even more writing not purely technical higher level documentation in Russian, expressing general ideas and approaches.

Russian developers with little or no experience in international environment often suck even at naming things in proper English, especially methods/functions. How many times have I seen grammar errors in APIs? But some of those guys are really great engineers who just aren't that much into spoken languages. Some of them aren't great writers even in Russian.

Yes, sometimes the source they write will certainly insult someone else's sense of beauty. But is there anything we can do about it? We can do something but still some people just aren't good at languages. I think we should just accept that it and stop worrying. And those whose sense of beauty is being insulted should think why they choose to use their knowledge and abilities as a weapon against their teammates or as a decoration to create a false sense of superiority. Pretty sure many of them do this out of their inner insecurity or because of attachment (unhealthy perfectionism, insulted sense of beauty).

Also I think that projectional (structural) editors may change this situation or at least give us some insights on how natural languages affect the way we structure our software.


Programming languages aren't really based on English. I'd say they are farther from English than English is from Proto-Indo-European.


They use English keywords, English documentation, English reference to the standard library, until recently were mostly restricted to English alphabet, and their development is mostly discussed in English.


Right, English uses some Proto-Indo-European words too. And Go designer compained that byte[] isn't English enough, so they did []byte, which is apparently perfectly English.


i've seen more comments in russian than in english. what's your numbers?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: