Nice. This blog post is less focused on npm/node than previous ones. Does that imply a broadening of ecosystem support is intended? As I'm sure you're aware, competing companies support more language ecosystems. Hopefully this seed funding will allow for some degree of scope creep.
Were you able to resolve the potential issue with npm's fourth open source condition[1]? The addition of this condition seems to align with their acquisition of ^Lift Security[2] based off of archive.org snapshots of before[3] and after[4]. Shifting away from npm exclusively seems like a reasonable way to hedge against this.
Yes, we intend to broaden the language support as soon as we can! This funding will definitely help get us there.
I'm not too worried about the npm condition. My reading of it is that it's intended to prohibit using security data generated by npm itself. When talking about "data about the security of Packages" they give the examples of "vulnerability reports, audit status reports, and supplementary security documentation". We don't use any of that stuff.
There seems to be some improvements in this area. Machine learning papers sometimes provide their data sets and other artifacts on websites like paperswithcode.com.
Have you ever contacted the authors to request their data? I personally have not.
That should continue to increase. Hopefully it will be academic misconduct to not release code - but also further, it should be misconduct for the code not to be able to reproduce the paper's output with a one line command.
Solving problems by studying tomes of knowledge is the job description of wizards/witches. Large improvements towards optimality, for some problems, are effectively locked away in some of these papers. As the article points out, there generally isn't much benefit in the context of building CRUD apps.
Some contexts have larger research communities. For example, there isn't nearly as many papers on real-time path planning for agent mutable environments vs static environments. I assume this is because we still don't have Boston Dynamics robots in people's homes. If we could get the cost low enough it may be more profitable to send mining robots to mars than people, but I guess there are other applications as well.
I spent some months trying to find, understand, and implement the state-of-the-art algorithms in real-time path planning within mutable environments(Minecraft). I started with graph algorithms like A*[0] and their extensions. For my problem this was very slow. D* lite[1] seemed like an improvement, but it has issues with updates near its root. Sample based planners came next such as rrt[2], rrt*, and many others.
I built a virtual reality website to visualize and interact with the rrt* algorithm. I can release this if anyone is interested. I've found that many papers do a poor job describing when their algorithms perform poorly. The best way I've found to understand an algorithm's behavior is to implement it, apply it to different problems, and visualize the execution over many problem instances. This is time consuming, but yields the best understanding in my experience.
Sample based planners have issues with formations like bug traps. For my use case this was a large issue. Moving over to Monte Carlo Tree Search(MCTS)[3] worked very well given the nature of decision making when moving through an environment the agent can change. The way it builds great plans from random attempts of path planning is still shocking.
Someone must incorporate these papers' best aspects into novel solutions. There exists an opportunity to extract value from the information differential between research and industry. For some reason many papers do not provide source code. A good open source implementation brings these improvements to a larger audience.
Some good resources I've found are websites like Semantic Scholar[4] and arxiv[5] along with survey papers such as one for MCTS[3]. The later half of this article is what gets me excited to build new things. I would encourage people to explore the vast landscape of problems to find one that interests them then look into the research.
Maybe a bit tangential to the original comment, but have you also checked some practical implementations of path planning in Minecraft, such as Baritone (https://github.com/cabaletta/baritone)? Practical as in, it is actually deployed widely to automate various kinds of complex tasks (building structures, automate mining, killing other human players) for bots in anarchy servers?
Although the method uses a variant of A* and might not be that “fancy” in academia terms, it’s astonishing how far it can achieve (see demos like [1] and [2]) and might actually be far more useful to study it closely instead of more theoretical papers.
This framework is very similar to an a priori, multi-objective optimization using linearly scalarized weights[0]. It is a priori because the weights are chosen before scoring and kept constant.
I've found this approach works generally well for humans, however results may not be pareto-optimal.
There is both a large need for improvements in npm supply chain security and a market willing to pay for them.
Concerns:
1) npm Open-Source Terms condition four[1] states.
You may access and use data about the security of Packages, such as vulnerability reports, audit status reports, and supplementary security documentation, only for your own personal or internal business purposes. You may not provide others access to, copies of, or use of npm data about the security of Packages, directly or as part of other products or services.
This statement seems vague enough to potentially include your use case. It also seems to include what snyk, jfrog xray, sonatype, and white-source do, so maybe this is not an issue.
2) It appears that this will be an open-core business. What capabilities are you willing to provide in the free/community edition and under which licenses?
3) The website doesn't show pricing. Can you provide details on this?
Questions:
1) What are your thoughts on using reproducible builds[2] plus Diverse Double-Compiling (DDC)[3] on the dependency graph to ensure build artifacts originate from known git repositories? Disclosure, I've been working on this for a few months now.
2) Where do you run your analysis? AWS and DigitalOcean have terms that prevent running high risk code.
3) Do you have examples of previous attacks and how your tooling would handle them?
Concern 1) I wasn't aware of this clause. Given how widespread the use of "npm data" is by the community I can't imagine they want to actually enforce this. But good to know.
2 and 3) We're still figuring out the business model, but here's our current plan: Package search and Package Health Scores are free for everyone to use through our website https://socket.dev.
Socket integrations, such as the GitHub App, are free for open source repositories forever. For private repositories, Socket is free while we're in beta, but we'll eventually charge something like ~$20/developer/month for private repos. We're still working out pricing but our #1 aim is to keep it affordable so everyone can get protected.
Question 1) I love this idea! This is something the team is already talking about. We want Socket to report reproducible builds and use them as a positive signal, as well as highlight them as a badge on the package page. For npm packages, lots of them probably already have reproducible builds that we can check by just running `npm install; npm build; npm pack`. I need to think more about DDC and how that would fit it. Perhaps we can chat about it sometime?
2) We're currently doing static analysis, so not actually running the code. Our dynamic analysis isn't ready yet so we'll cross that bridge when we get there.
3) All of the issues that Socket detects were picked with previous npm supply chain attacks in mind. You can see a list packages npm removed for security reasons here: https://socket.dev/npm/category/removed When you view any of these, we show the results of our security analysis. Here is a removed package I just picked at random to give you an idea:
Correct me if I'm wrong, but this shows that in the last 90 days 5.06 billion visits came from 31.3% Windows while 1.1% were from GNU/Linux. If we assume that both groups visit government websites equally often, then for every GNU/Linux user there exist (31.3/1.1) = 28.5 Windows users. Scary stuff.
1.1% is way too high. I bet they didn't filter out all the scrapers that poll gov't websites.
Hmm I wonder if just parsing the HTML still works like it did 8 years ago when I had to scrape the USPS: https://github.com/NavinF/USPS-scraper/blob/master/USPS_scra...
As long as the USPS only allows API requests from browsers (as opposed to the much more common situation where you need to update the status of every tracking number in a database), people still have to scrape their website pretending to be a browser.
Oh they had an API 8 years ago too. It’s just that they only let you use that API from JavaScript running on your users’ browsers.
The undocumented tiny ratelimits and threat of bans for server-side API users (while no such ratelimits applied to the HTML pages) forced pretty much every app to scrape their HTML server side.
From my README which quotes their old docs: “Note: The United States Postal Service expressly prohibits the use of Web Tools "scripting" without prior approval. Web Tools scripting can be defined as a technique to generate large volumes of Web Tools XML request transactions that are database- or batch-driven under program control, instead of being driven by individual user requests from a web site or a client software package. The USPS reserves the right to suspend server access without notification by any offending party that does not have prior approval for Web Tools scripting. Registered Web Tools customers that believe they have a legitimate requirement for Web Tools scripting should contact the ICCC to request approval.”
For what it's worth, Linux is over represented in the (fake) User-agent strings of the bots that attack my web servers. Most probably are indeed on linux, since they are predominantly scripts running on cloud providers. :)
That site shows 113.1 million visits in the last 90 days with 47.2% being Windows and 1.2% linux. (47.2/1.2) = 39.3 windows users for every linux user.
I agree that it is skewed in a number of ways. I just wanted to estimate a lower bound.
Thank you for pointing out a better source of data for my use case.
AWS CloudFormation had one of the worst user stories related to iteration time a few years ago.
The lack of feedback between the writing of the yaml file and validation of the structure/type/format was frustratingly slow. I would pay good money for better tooling in the Infrastructure as Code(IaC) space.
Waiting for the resources to update, fail with cryptic error messages, then slowly rollback only to then fail the rollback. Now in an invalid state, manual resource creation was required before the rollback would succeed.
The AWS cdk has improved this significantly. As a result the sun shines just a little bit brighter.
Were you able to resolve the potential issue with npm's fourth open source condition[1]? The addition of this condition seems to align with their acquisition of ^Lift Security[2] based off of archive.org snapshots of before[3] and after[4]. Shifting away from npm exclusively seems like a reasonable way to hedge against this.
[1] https://docs.npmjs.com/policies/open-source-terms#conditions
[2] https://blog.npmjs.org/post/172793182214/npm-acquires-lift-s...
[3] https://web.archive.org/web/20170926030855/https://docs.npmj...
[4] https://web.archive.org/web/20190207170526/https://www.npmjs...