Simple but probably wrong solution; why not ban obfuscation libraries,
compressed and self-loading code within the PyPI ecosystem. Any
package that even refers to illegible non-source techniques gets
flagged and blocked? It seems the whole PyPI ecosystem is
undisciplined and could be tightened up. Why can't we progress here?
You can pip install complex stand alone executables, such as nodejs, and it's used in the entire ecosystem.
In fact, most packages are now wheels, which are not sources: they are compressed, and may contain binaries for compiles extensions, something extremely popular (the scientific and AI stacks exist only because of this).
Some packages need to be compiled after the fact, something that setup.py will trigger, and some even embed a fallback compiler, like some cython based packages.
Also, remember there is very few people working on pypi, there is no moderation, anybody can publish anything, so you would need a bullet proof automated heuristic. That's either impractical, or too expensive.
If you want a secure package distribution platform, there are commercial ones, such as anaconda. You get what you pay for.
Self-loading code is a huge part of the value-add of python libraries. Many of the popular libraries (e.g. Numpy and friends) trigger a bewildering chain of events to compile from source if not installing from pre-built wheels. And if you do have wheels, you have opaque binary blobs. So pick your poison: compile-on-install with possible backdoor or prebuilt .so/.dylib/.pyc with possible backdoor.
The most obvious (but not necessarily easiest) approach is to phase out setup.py and move everything to the declarative pyproject.toml approach. This is not just better for metadata (setup scripts make it really hard to statically infer what deps a lib has), it also allows for better control over what installers/toolchains run on install.
Attackers still have quite a lot of latitude during the build phase, but at least libraries have the option to specify declaratively what permissions they need (and presumably the user has the option to forbid them).
Also eval/exec are terrible and I wish there were a mode to disable their usage, but I don't know if the python runtime has some deep dependency on it. Maybe there's a way to restrict it so that only low level frames can call the eval opcode.
Would it be possible that the wheels could be built in a more-trusted / hardened environment? Having a binary blob isn't as serious when it comes from a trusted source. Almost all Debian/etc linux distributions have this feature (binary-downloading package manager).
The hardening could mitigate on-compilation hacking.
Obviously, this leaves "compile in the backdoor and wait for the user to fall into it" but at least this isn't an issue of compiling on the user's computer and it isn't a issue of binary blobs. And possibly there's a greater chance of detection if actual source code has to be available to compile.
>Also eval/exec are terrible and I wish there were a mode to disable their usage,
You can use audit hooks in the sys module (as long as you load it first) to disable eval/exec/process spawning or even arbitrary imports or network requests.
I’ve been building Packj [1] to flag PyPI/NPM/Ruby packages that contain suspicious decode+exec and other “risky” APIs using static analysis. It also uses strace-based dynamic analysis to monitor install-time filesystem/network activities. We have detected a bunch of malware with the tool.
The short answer is that this can’t be easily mitigated at the package index level, at least not without massive breaking chances to the Python packaging ecosystem: PyPI would have to ban all setup.py based source distributions.
Even then, that only pushes the problem down a layer: you’re still fundamentally installing third party code, which can do whatever it pleases. The problem then becomes one of static analysis, for which precision is the major limitation (in effect, just continuing the cat-and-mouse game.)
Why would you think that would change a thing? Also, obfuscation has legitimate uses by people making stuff they don't want easily reversed. This isn't a python specific problem.
Yeah, just get rid of anything that has a binary blob. Cool. And then when PyPI gets swapped out for whatever immediately replaces it because PyPI is useless, then at least PyPI will be secure.
Yes, but pypi has 4 millions release to check, and the scientific and machine learning wheels are very hard to compile (scipy contains c, fortran and assembly code, and must be compiled to mac, linux and windows).
Providing a build env for that would make it prohibitively complicated and expensive, and basically would mirror github CI.
That's the reason continuum is making money: they sell a python package distribution channel that is checked and locked.
Most of the libraries I use include compiled C/C++/Fortran/Rust code. Pandas, scipy, scikit-learn, … if I were limited to pure-python libraries, I would probably rather swap languages, or at least package manager, at great inconvenience.
That being said, I don’t think PyPI would be «useless» - this was the state a few years ago, and we had to compile all the libraries ourselves. I don’t want to go back.
None of those packages are downloading and running CRAP.EXE within the setup.py process, that's not how native extensions work. It should be possible to flag packages that are downloading things when setup.py runs, much less running exec within setup.py. a python package that really needs you to run a windows installer for its dependencies should have you be doing that separately.
Yes but the problem here is the obfuscation of the malware code loading. No need to trigger it in the setup.py process, as long as you have it in the lib, you can always put a call in a .pth somewhere and run your malware as soon as any python is executed.
it should be possible to test packages for that also. if you are testing setup.py to see that no network access or exec occurs, you could similarly run the python interpreter after install and ensure no network / exec() happens at that point either, assuming one has not imported the package. or just disallow unfamiliar .pth files from being installed altogether (outside of those generated by setuptools / etc. for normal execution).
Given that every attempt to sandbox python failed, and that every system exposed to the public have been pwn, I assume this is a cat and mouse game we can't win.
At best I suppose we could put in place checks to get low hanging fruits. But we are, after all, allowing a turing complete and highly dynamic language to execute.
> or just disallow unfamiliar .pth files from being installed altogether
That would kill the entire plugin ecosystem.
Now, the next thing could be to have a permission system, requesting access to the network, fs, .pth, etc. It would not be a bad idea, given that we are, after all, installing things that are as powerful as apps.
But it would be a gigantic effort, and users still would just accept without reading, like they do with apps.
Sure, I didn’t intend to claim that. It’s just a hassle for me to compile my own C code, which I’d have to do if binaries weren’t bundled. That’s why anaconda python took off on windows - it’s hard work to compile scipy on windows!
pypi delivers wheel files for pre-built binaries, and that's the only way one is supposed to distribute pre-built binary executables or shared libraries. the issue of "runs malicious code in setup.py" does not apply in that case because setup.py isn't invoked.
The only solution I've ever seen to that requires investing trust in
an "authority" which then becomes corrupt and censorial. One simply
expands the dilemma to a triad; security/freedom/convenience.
If I am not mistaken the PyPI "Cheese Shop" is owned by the Python
Software Foundation, a 501(c3) nonprofit organisation which
constitutionally values Software Freedom highly. It seems natural that
convenience would be sacrificed if security is of concern.
Such an authority in the Linux world used to be a distribution. Installing a binary blob provided by Debian build servers is based on decades of trust.
But there is a tradeoff between having things thoroughly vetted and tested, and moving fast.
Who can build me a UI with four sliders that selects the packages I
can install? Bonus: when I move a slider it highlights all the
potential packages that changed status with reasons why they are now
included/excluded.
You're right the prototype GUI is a weekend of work. But you also know
that's not where the work is :) Now some more intelligent comments are
coming in we can talk about the analysis and tagging of thousands of
packages, dealing with backward compatibility and what happens when
naughty malware just hops to another level of trust.
But none of that is a call to give up. We just need to think seriously
about the problem we face.
Windows S Mode has PyPI restricted to pure Python due to Device Guard. I'm happy to leave it on ($250 laptop). Indeed, Numpy has been a recurring blocker, maybe 3 times now. But with general peace of mind is the only way I've known Python/PyPI, so I'm pretty happy with it. I have a few RasPis that I can use as auxiliary devices as well, which I think is a pretty cool tradeoff, hardware sandbox--not gone there yet, beyond just configuring SSH/xRDP so I'm ready if the day comes.
But I've made a ton of web apps and tools anyway, including a little process launcher that plays the role of poor man's Docker.
It'd be nice if those popular systems had a pure Python capability anyway, similar analogy being software rendered 3D back in the day.
A malicious author could embed malicious code in the package and still get the package signed. Hashing won't prevent this sort of thing on PyPi, it just addresses in transit and alternate supplier attacks.
Requiring anything from open source authors is a losing proposition. Items of interest just won't end up on pypi. Iirc this chain of events already happened on another distribution platform.
One of the underappreciated benefits of Richard Stallman getting what he wants would be that antivirus programs could then be updated to flag on all obfuscated code or anti-debugging actions.
Those things you named are just one of the checks it made. The python part of it was also an encode bzip file that offers a bit of debugging headache, then it downloads a pyc file which was run through an obfuscator, which more of a python headache. Your "in fact" is not a fact.
The methods this malware uses for anti debugging wouldn't cause headache for anyone that isn't completely new to the subject. Download 10 random python malware samples and you'll notice that probably at least 8 of them follow this exact same packing and execution pattern. Discord hook and laughable end payload are a good indication that whoever wrote this is probably some high school kid.
The only surprising thing about this article is the claim that these type of malware haven't been spotted in pypi before. That would suggest that there isn't much of credible actors trying to spread through pypi at all.
Huh. It never ceases to amaze me when another demonstration is presented to me that “plus ça change, plus c'est la même chose“ in this industry. I suppose it is only to be expected that some of the old anti-piracy techniques found in 8-bit floppy- and cassette-distributed software might eventually find new philosophically-similar implementations in malware in the future.
Some of that self-modifying and anti-defeat code back then were truly works of art, and squeezed into mind-bogglingly small memory and cpu foot prints, and the malware authors will have a field day re-implementing their future cousins in spirit, and some of the greybeards amongst the white hats will get to relive their 8-bit glory days hunting and defeating them.
The article gave a description of a really super primitive technique compared to the last generation of those anti-piracy techniques, but I still see a family resemblance.
The more I hear this stuff the more I write things in Go with no external dependencies pulled in. I can do 95% of what I need to do without involving a supply chain or downloading anything random off the internet other than the go distribution itself.
I like the sentiment and I'm usually first in line to ridicule the 'npm install left-pad' crowd, but this doesn't always fly. Python is a great glue language to mash high performance C/fortran components together. One does not simply write sklearn or pytorch from scratch.
"Python is a great glue language to mash high performance C"
This is exactly what I'm starting to work through. After 6 years of Python, I've finally hit the limit of what I can do with it. Now I'm working to rebuild an algorithm in C to reconnect to the Python application.
"One does not simply write sklearn or pytorch from scratch."
I also agree with this. Would either be in a product though? Personally, if it's not a product, I wouldn't mind dependencies.
Yes, they are in at least one product I can think of, and likely more. That product deploys its own conda environment and includes a huge amount of spatial analytical tools.
Governments and large private enterprise the world over use ArcGIS Pro, as do many NGOs and education institutions, which is a massive leap forward for both desktop and highly integrated Web GIS work.
I'd be prepared to be a bit of blind money that other industry tools use a similar setup where the python libraries permit an exceptional cadence of development and help place those vendors products at the pointy end of the market.
How they manage dependency security isn't super clear. They're always a couple of version behind, so perhaps it's a CI/CD QA/QC thing which also includes security.
I get the general idea, but at the same time, I don't have the time to write my own libraries from scratch - all modern web standards are complex and most libraries filled with years to decades worth of experience of all the edge cases that crop up, particularly as most standards don't carry a "compliance test suite".
It's one thing if I were paid by my employer to re-invent the wheel, but for personal projects... I don't have that much free time for them in the first place any more, I want to get shit done and not shave yaks all day. When I want a good grind, I'll pack out Factorio or one of the LEGO Switch games...
There's a difference in values between those who reinvent the wheel and those who leverage opensource. It sounds like you value time-to-product whereas I value ownership of said product.
There are always risks associated with building on other people's land, platforms, and codebases. However, there are also risks when reinventing the wheel. Both perspectives have advantages, disadvantages, and use cases.
A compromise is to audit and then pin exact versions, or even copy and paste the code into your project. Yes, this is a clear tradeoff in that you'll lose access to newer updates, but it's certainly worth thinking about. I do it with relatively trivial libraries for things that I know the package has solved various edge cases, is small in scope, and probably won't be updated again, for example.
I always build my whole computer from scratch from NAND gates all the way up to the full OS, build my own switches, cut the network cables myself, dependencies be dammed. /s
For python at least, most of the dependencies are very justifiable. The python stdlib is very huge and satisfies most regular programs such as glue code. But for web and ML it is not possible to include these libraries in stdlib nor is it feasible to write it from scratch
Let's say you are writing an API that works with some particular scientific file types on the back end, and you want to load that data into memory for fast querying and returns. Now, that data is a multidimensional time series for each file. You could spend the next months writing libraries and bashing your head against the wall, or you could leverage the 30+ years of development in that stack that enables you to read these.
Xarray to read, numba for calcs in xarray, pandas to leave it sitting in a dataframe, numpy as pandas preferred math provider. You could write the api componentry from there, sure. Or you could use a library that has had the pants tested off it and covered most of the bugs you are likely to accidentally create along the way.
There's no compelling reason to write everything from scratch. If everyone was taking that approach then there would be no reason to have an ecosystem of libraries, and development would grind to a halt because we, as a collective of people programming, are not being efficient.
I see no compelling reason to implement a multidimensional time series for multiple files as a component of any backend API that consumes user (defined) data.
In what circumstance could that be profitable? Even if you batched data, any number of concurrent users would gobble resources at an incredible rate.
Who said anything about profit? Not everything that exists to be solved, and for which their is a demand is driven by profit. Think: regulation, environmental, NGO, citizen science, academia, government agency, public service. All places where systems can exist that are not for profit, but do grant significant capabilities to their user base.
Also, it's a particularly arrogant point of view to assume that because you cannot see a reason for something to exist that its development is invalid both now and into the future. You've also assumed the data is user defined.
I can also guarantee you that user concurrency is not an issue after some recent load testing, with load capabilities surpassing expected user requests by several orders of magnitude whilst on minimum hardware.
I probably should have said economically viable. Handling and manipulating data like that is intensive and thus expensive. If it's not user provided data, why manipulate data with that approach?
Maybe it is arrogant. That entirely depends on whether or not a product or service uses this specific approach -- successfully. Do you have an example?
Edit: I also want to clarify that my comment doesn't suggest that the underlying technology is bad or without use cases; only that it isn't suited for remote (online) processing. It would be way cheaper to manipulate data like that locally.
Thats the point. It's not user data, and the data cannot be manipulated on the user side without excessive hardware, software, and troubleshooting skills.
Taking that scientific data and making it available in report format for those which need it that way, when the underlying data changes at a minimum once per day, is the more important aspect.
The API is currently returning queries in about 0.1 to 0.2s. They are handled async right the way through. It's fast, efficient, and the end result whilst very early in the piece is looking nice. Early user engagement has been overwhelmingly positive.
It's not a public endpoint, and the api is still under dev with interface largely yet to start. So, can't share / won't share. Sorry.
Where it will be shared is among those with an interest in the specific space. That includes government agencies, land managers, consultancies etc. At no cost to them, because what the outputs can help offset in terms of environmental cost dwarves dev cost.
Ceres Imaging (Aerial and satellite imagery analytics for farming), Convoy (Trucking load auctions), etc. There are plenty of companies doing very real work that need this kind of heavy numeric lifting.
Very cool examples. Thank you for sharing. I'm going to read into them. I'm not familiar with any web companies using this technology so it'll be interesting to dig in.
Flask seems to be a very stable and feature-complete framework (I see about 3 commits per year for the last few years).
At this point isn't it easier and just as safe to manually review the code, pin the hash in a lockfile, and manually review the rare changes than it is to rewrite everything?
Can someone explain why this comment is getting downvoted? I believe the statement is accurate. I'm not looking to justify or debate my position, but a clear answer might help me better approach this topic in the future.
Standards and requirements will change, bits will rot, and im not expecting any ecosystem, to keep up with comming and going demands.
A better solution imho would be project level capabilities, so you can pull in a dependency but restrict its lib/syscall access, so it would not compile when it turns malicious.
Maybe it will solve at least something, maybe some day.
Agree. I'd like to see an OpenBSD pledge(2) type system for libraries. So you can mask individual library capabilities rather than just programs. I don't want a web server that can write to the file system and I don't want a CSV reader that can talk to the network.
Doing this kind of thing at the library level is generally not very useful, because security protections between things running in the same process are hard to make very strong.
This is a limitation of the particular language/ecosystem though, it feasible in a new language that has this security baked in to the language primitives.
I don't think the Go-stdlib is significant better than the Python-batteries. For normal stuff, you can build without dependencies in python too. The problem starts when you use more complex stuff, or want to save time by using a lib delivering certain benefits. After all, you can't build and maintain everything by yourself.
I’m wonder if there is room for a security model based around “escrow builds”.
Imagine if PyPi could take pure source code, and run a standardized wheel build for you. That pipeline would include running security linters on the source. Then you can install the escrow version of the artifact instead of the one produced by the project maintainers.
You can even have a capability model - most installers should not need to run onbuild/oninstall hooks. So by default don’t grant that.
This sidesteps a bunch of supply-chain attacks. The cost is that there is some labor required to maintain these escrow pipelines.
With modern build tools I think this might not be unworkable, particularly given that small libraries would be incentivized to adopt standardized structures if it means they get the “green padlock” equivalent.
Libraries that genuinely have special needs like numpy could always go outside this system, and have a “be careful where you install this package from” warning. But most libraries simply have no need for the machinery being exploited here.
What does it mean for a package to have been signed with the key granted to the CI build server?
Does a Release Manager (or primary maintainer) again sign what the build farm produced once? What sort of consensus on PR approval and build output justifies use of the build artifact signing key granted to a CI build server?
> Pushing to regro-cf-autotick-bot branch¶ When a new version of a package is released on PyPI/CRAN/.., we have a bot that automatically creates version updates for the feedstock. In most cases you can simply merge this PR and it should include all changes. When certain things have changed upstream, e.g. the dependencies, you will still have to do changes to the created PR. As feedstock maintainer, you don’t have to create a new PR for that but can simply push to the branch the bot created. There are two alternatives […]
nektos/act is one way to run a github-actions.yml build definition locally; without CI (e.g. GitLab Runner, which requires ~--privileged access to the docker/Podman socket) to check whether you get the exact same build artifacts as the CI build farm
https://github.com/nektos/act
A Multi-stage Dockerfile has multiple FROM instructions: you can build 1) a container for running the build which has build essentials like a compiler (GCC, LLVM) and packaging tools and keys; and 2) COPY the build artifact (probably one or more signed software packages) --from the build stage container to a container which appropriately lacks a compiler for production.
https://www.google.com/search?q=multi+stage+Dockerfile
Are there guidelines for excluding entropy like the commit hash and build time so that the artifact hashes are exactly the same; are reproducible on my machine, too?
>Libraries that genuinely have special needs like numpy could always go outside this system, and have a “be careful where you install this package from” warning. But most libraries simply have no need for the machinery being exploited here.
My personal experience with any situation where I need to get some crusty random python library to run has always been a situation with a lot of "-y"ing, swearing, and sketchy conda repositories. Usually it's code that was written years ago and does some very particular algorithm that's essential, so any warnings in the pipeline basically becomes ignored by the sheer difficultly of the task.
Apologies for the naive or off-topic question. I'm still a relatively new hobby Pythoner, and no formal training in CS.
I clearly get the security risks associated with random libs available for Python. Is this also the case for other languages like Java? Are the dependencies available to them also a relative free-for-all, or are bugs mostly accidental?
I think there is always a danger, for every language, when you install a 3rd party dependency from a package repoitory. But usually this is restricted to the runtime of the application that uses the 3rd party library (and maybe, depending on the language, the code-paths that are executed).
That's a difficult enough problem to deal with already, but with Python, it's possible to execute code at install time of such a 3rd party library (basically, when you do a 'pip install stuff'). So, you might never have run the application you installed, but you'd still have executed whatever malware was hiding. This is not the case for a lot of other languages. Also, Python allows the execution of code when you have an `import stuff` statement, which is also not the case in other languages, often. But this is not directly related to this, just another 'Python-specific' attack vector.
All of these libraries are completely secure as eval/exec are used with code fragments that are generated by the libraries, not based on untrusted input.
eval() /exec() are not running executable files, just Python code, the same way all the rest of the package is already doing.
please support your assertion. I would also recommend opening CVEs detailing your discovered attack vectors, especially that of Python dataclasses in the standard library, which are in very widespread use. If you do in fact have some insight on how Python dataclasses are an "exploit waiting to happen", I think it's irresponsible to just sit on that information.
If you run a security linter like ‘bandit’ you’ll get warnings for eval and other security holes.
It seems you can’t run bandit on deps, but perhaps if you fork them and build yourself?
If you are security conscious, having a rule that you can only install from a local pypi with packages you have forked would be a more defensible perimeter. But, a maintenance pain for sure.
> Malware that is more stealth-conscious would just stop running without any indication, instead of interacting with external processes.
I always wondered if we could just use this against the malware. E.g. just run a useless process which is named/looks like a debugger and the malware stops itself. Of course that's nothing to be relied on on its own but maybe as an additional layer of defense?