This is very interesting work. I've been looking for something exactly like this to use on a large C application -- specifically to be able to annotate various API's as sources of different kinds of data, checks on how the data types are permitted to be used together, and operations that transform one kind to another. Compared to taint analysis we want to allow more categories than tainted/untainted, and transforming items between categories. Do you have any recommendations for similar tools that work with C?
It seems one of the major downfalls is that the user has to define all sources and sinks. I might have missed it but how do you systematically define/find these? Personally was interested in a similar topic for a thesis and stumbled upon deepcode.ai which started out of ETH Zurich (https://files.sri.inf.ethz.ch/website/papers/scalable-taint-...). Are there any plans or reasons why you would not want such a system?
The article briefly mentions this, although it might not be super clear from the short description - "We regularly review issues reported through other avenues, such as our bug bounty program, to ensure that we correct any false negatives." We rely on these mechanisms to find places where we're missing taint coverage and write sources and sinks as necessary. As of right now, all the annotations are manual.
I hadn't looked too deeply into the literature there, the paper looks really interesting! We don't have any concrete plans to implement such a system, but I don't think there's any fundamental reason we wouldn't want automatic taint model generation. I'll give the paper a read on Monday to learn more :)
Not sure if you can answer this, but what are some classes of security bugs you can find with Pysa? I've only worked on smaller codebases so security I've dealt with is mostly AuthN/AuthZ.
Pysa can find any bug that you can model as a flow of data from one place to another. That includes your standard web app bugs like SQLi, RCE, etc., also some AuthN/AuthZ bugs depending on how you do your checks. Concretely, this is a list of the vulnerabilities Pysa able to catch out of the box without any customization:
https://github.com/facebook/pyre-check/blob/6975ff55fc59b7b9...
You can find most of security issues with Pysa that you can model as a taint flow problem. Examples could be flows to function that enable code execution or shell injection, SQL injection, SSRF, XSS and many others. As long as you can model the security issue in a taint-flow model then Pysa should be able to detect these issues. These are the configuration we share with Pysa where you can find examples of bug categories we detect https://github.com/facebook/pyre-check/blob/master/stubs/tai...
In addition to what Sinan said, we've had success running on fully untyped codebases. These are you some strategies that you can use to get results on untyped codebases:
https://pyre-check.org/docs/pysa-coverage.html
Types will definitely make Pysa find more issues, but you don't need 100% coverage, or really more than the minimal coverage described in that doc I linked, to start finding some issues.
Pysa will try to analyze all functions regardless of whether they have type hints, but it work better if the function under consideration is typed. Namely, without type hints, it won't be able to pick up on tainted method calls or attribute accesses. However, regular function calls, etc. and standard data structures like dicts and lists should still be tracked normally.
How fast does Pysa typically run? If I want to run it as part of my CI system, how much additional time might I expect it to add? Obviously this varies from code base to code base, but I'm curious what the experience at Instagram is like?
For Instagram (millions of LOC), the analysis gives feedback to engineers in about 65 minutes on average - note that this is in the context of a diff run: We compare the results of a run on the base revision to the proposed changes, running the tool once or twice depending on whether we hit the cache. It's hard to say how long it'll take on your repository as it depends on a lot of factors, but hopefully that provides some intuition.
1. Pysa should work without watchman - it shares some code and infrastructure with Pyre, but doesn't need Watchman to complete its analysis.
2. Hopefully the answer to (1) helps here. Pysa shares some code with Pyre, including the parallelization infrastructure
- the same infrastructure that makes Pyre fast interactively makes Pysa fast on large codebases. Living on the Pyre GitHub repo allows Pysa to use the parallelization infra, in addition to the type checking APIs of Pyre as necessary.
Pyre & Pysa try to do a best-effort analysis of Python 2, and supports Python 2 style taint annotations, but most of the code we analyze at Facebook is Python 3.6+.
The reasons turn out to be decently boring here. Zoncolan [0] and Pyre [1], which Pysa shares core libraries with, are also written in OCaml, and the language made sense to use from the perspective of both sharing code and having people who are proficient and comfortable writing in OCaml working on the project.
This seems like a good idea and the more open source static analyzers the better. (It really tempts me to eventually pay for GitLab high versions.)
Pysa is part of pyre-check and the documentation [0] seems like a lot of work to set up and hope it gets better.
I’m using to using safety [1] and bandit [2] and they are one line drop ins to my builds.
Pysa isn’t the same thing and seems much more powerful but I hope they get to a “Just give me something useful out of the box and I’ll customize my taint scans later.”