Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Bedrock AI (YC S21) – Using ML to identify red flags in SEC filings
183 points by kbennatti 14 days ago | hide | past | favorite | 107 comments
We’re Kris, Suhas, and Heather (YCS21) and we’re building Bedrock AI (https://bedrock-ai.com/). We use machine learning to extract hard-to-find information and assess risk in public company reports (SEC filings). Our platform is used by investors to improve portfolio returns and mitigate downside risks.

Most public company data is unstructured and textual. Because relevant information is hard to find, a lot of corporate data is radically underused, to the detriment of investors. For example, our research shows it can take 12-18 months for corporate malfeasance to be incorporated into stock price after clear warning signs appear in financial text. Hard-to-find information that we extract includes accounting and governance choices, product defects, regulatory issues, customer/market reliance and much more.

One example is Sino-Forest, Canada's Enron. Sino-Forest was a darling of Canadian investors until an infamous exposé, by short-seller Muddy Waters, in 2011. It turned out it was a forestry company that didn’t actually own any forests. Months before the exposé and crash, there were obvious red flags in the company’s disclosures including buying and selling from companies controlled by their directors and problems with the review of their bookkeeping! Our algorithms picked up these red flags and more, and assessed Sino-Forest as high risk when we ran our models on the company’s historical filings.

I’m a CPA and a developer (odd combo). The tech community has largely ignored public company financial disclosure. A few years ago, I published a basic piece using computational methods to analyze cannabis disclosure. The local regulatory agency contacted me to give them a workshop on text analytics. It was then that it hit home how little was being done in the field.

Information drives financial markets. The difficulty of assessing risks hidden in long public filings makes earning manipulation, and even fraud, both possible and profitable. Earnings manipulation involves using the flexibility in accounting standards to make financial statements look better than reality. This is easier than most people realize because accounting involves MANY choices and estimates.

There is money to be made by accessing and trading on underused predictive signals. Making money by stopping fraud is a win-win situation.

There are two main technical challenges thwarting progress in the field: (1) NLP models work best on short (500 character) text, but financial filings are hundreds of pages long, and (2) important and unimportant language sounds very similar in financial text. For instance, this sentence sounds like it could be indicative of terrible things going on behind the scenes but is in fact, just boilerplate disclosure: “We face risks and uncertainties related to litigation, regulatory actions and government investigations and inquiries.” You can see how ML models easily get confused.

There’s a big gap in both academia and industry. A lot of effort is being put into forcing results from non-existent linguistic signals. Models that claim to predict specific outcomes often don’t hold up to scrutiny in practice.

In order to overcome the technical challenges we used supervised and semi-supervised learning with high quality labels, we focused on tangible facts represented in textual context, and we adapted language models using domain expertise.

As far as we know, no other solution is able to identify problematic/risky disclosure algorithmically. Using search terms to do something similar results in overwhelming noise. The disclosure selected by our algorithms is highly predictive of downside risk - validated in deployment and also in backtesting.

We launched our core product in April 2021 (see https://bedrock-ai.com) and it’s used by hedge funds and institutional investors. We’re also doing a pilot to support Canadian securities regulators (https://bit.ly/3wOwOj6). We’ve also just launched a minimalist free site, Ledge (https://ledge.bedrock-ai.com), to help retail investors stay up to date on material events at companies they follow. Companies are required to disclose material events between their quarterly reports, but these disclosures rarely make the news.

Our core/premium product is currently only available to institutions, in part because retail investors generally don’t prioritize risk management and therefore aren’t committed customers. We plan to expand the free site and better support individual investors as we grow.

We would love to hear from you. Have you tried to read annual reports and gotten lost in the weeds? What has your experience been in making NLP models work on financial text?




> Our algorithms picked up these red flags and more, and assessed Sino-Forest as high risk when we ran our models on the company’s historical filings.

When you're backtesting your models, how do you distinguish between novel fraud that the industry is _now_ aware of vs. fraud that was visible but ignored -- if your model has learned _from_ Sino-forest, how do you know it would have caught Sino-Forest at the time?

> For instance, this sentence sounds like it could be indicative of terrible things going on behind the scenes but is in fact, just boilerplate disclosure: “We face risks and uncertainties related to litigation, regulatory actions and government investigations and inquiries.” You can see how ML models easily get confused.

Humans too, you're describing at least one comment in every HN thread about an SEC filing :D


Great q. Sino-Forest is an out-of-sample test so our models didn't technically "learn" from it. That said, very valid comment. Historical testing only goes so far. Assessing whether our algorithms work in deployment has been cool. Check out some of our live, in deployment examples here - https://bedrock.substack.com/p/bedrock-ai-vs-activist-shorts


>https://bedrock.substack.com/p/bedrock-ai-vs-activist-shorts

How often are companies rated with a risk factor this high. As in does a risk factor in the 80s mean that fraud is extremely likely or is it just notifying humans that this filing might be worth reading over with a fine-toothed comb.


Less than 10 percent of companies have a risk score above 80. Our historical testing shows that around 1/3rd of companies with scores above 80 turn out to be fraudulent. (It is hard to test this so this might be an overestimate)


Just about every company mentioned in the article increased in value near and/or after the time of those reports. Pretty interesting.

Great work!


> Humans too, you're describing at least one comment in every HN thread about an SEC filing :D

LOL. Yup!


Congratulations! I played with that database in 2017 when they opened up the text. I'm so happy to see it's finally been picked up.

I built a thing which recreated the Income statements and then flagged for non-conformance. I found Babcock & Wilcox Enterprises "Good will impairment charges" as an anomalous line in their Nov 8, 2017 filing just prior to when their CEO resigned in 2018 and they had to make so many adjustments. Unfortunately, I'm just a data geek. I did have a financial advisor friend who was VERY interested, but we couldn't get enough interest to validate development, so I moved onto other projects.

I'm no finance whiz, but that database is a trove. So happy to see a company develop around giving it a go!


Thanks for the wishes. That's really cool that you were able to do that!


Seconding Suhas here. Super rad


Which database?


I assume they are talking about EDGAR - https://www.sec.gov/edgar/search-and-access


I've worked with 10-K's and 8-K's extensively for the purposes of using them for NLP. This is extremely arduous work and a clear winner in terms of profitable ideas, so kudos to the team for the launch, this is really impressive.

Perhaps this is giving a bit too much away in terms of the secret sauce, but would love if you could talk a bit about how you handle the wild disparities in the structure of the documents. Do you parse the XBRL?


Thanks for the kind words! We don't use XBRL at all. We did try it initially, but it was wildly inconsistent across companies. I think one of the things that worked well for us was that we spent a lot of time at the initial stages of the pipeline (efficient sentence and word tokenization, span detection), that bode well for our models later on.


Thanks! This is similar to where I ended up landing as well. It turns out using a non-standardized standard format is practically worse than dealing with giant blobs of plain text!


So true


> For example, our research shows it can take 12-18 months for corporate malfeasance to be incorporated into stock price after clear warning signs appear in financial text.

One thing I noticed about "Efficient Market Theory" is that it is unfalsifiable. It isn't a scientific theory and it also cannot be proved true or false, only useful when convenient. It relies on magic, rationality, and the assumption of omniscience by large investment banks.

Nothing is priced in.


I love that. I agree until the "nothing" bit. Some things do get priced in. You can see specific stocks "react" to news, adjusted for market returns etc. ESG factors do appear to be getting "priced in" as well


Congrats on getting out into this space. I suspect you will be very successful.

Have you considered training a model to use the quarterly reports to identify companies likely to see a significant upside or downside over the next N months? Seems like it would be a very straightforward task and could pick out some real diamonds from the rough. Not sure how well it would work though... Could be a very simple way to chase returns.


Have you read Eliot Peper's novel Uncommon Stock? It is about a startup doing automated analysis of accounting information. "I'm a CPA and a developer" immediately reminded me of the book.

I must say the real thing is so much cooler than the story, even if the story itself is also very cool.


Hey guys, congrats on the Launch. Is there any API to get the metrics you guys calculate on SEC Filings?

I recently launched https://quantale.io which is a web-based Bloomberg Terminal alternative and we monitor SEC filings in real-time to show them to users[1].

It would be great to show additional data related to the SEC filings if there was an API.

[1] https://quantale.io/dashboard/sec-filings


Sweet! The world needs a Bloomberg alternative so props. We don't have an API atm. We're focused on supporting human analysts. The meat of the product is the red flags which are textual and not quantitative.


If you think you could use more data such as news headlines, reddit and twitter chatter, and forum discussions about the companies from the filings then let me know, Quantale can provide that. The additional data could augment the current fraud detection.


Interesting. Potentially? How do we get in touch?


email me at vikash@quantale.io

thank you


How are you different than the native macOS WeBull application? WeBull comes the closest I have seen to a Bloomberg terminal.


WeBull only provides financial and newsmedia data about the stocks but Quantale is the first of it's kind to provide not only Financial and newsmedia data but also real-time data from SEC, Reddit, and Twitter.

Along with that, additional features include:

- Top trending stocks from the internet in real-time

- Sentiment of the discussions using a pytorch model.

- Ability to save posts(reddit, twitter, news headline, sec filings)

- Ability to create watchlist to watch and monitor a group of tickers like SPACs.

Features in Roadmap:

- Alerts on change in the activity of stocks

- Level 2 Data

- Options Data

- Brokerage so that users can trade directly from Quantale

Please try it out at https://quantale.io and any feedback is appreciated. Feel free to reach me at vikash@quantale.io


This isn’t really true… there are a myriad of web tools that provide similar functionality and most are free such as:

- https://gamestonkterminal.netlify.app/ - https://gbear.trade - https://app.hypeequity.com/monitor


Just curious, are you doing anything to prevent companies from running drafts of their disclosures through your algorithm? That would allow them to tweak the disclosure until your AI gives it a positive score.

We don't sell to filers. Ever. Got to protect against conflicts of interest. That's one of the reasons we don't have a true free version of the product.

That said, we're using language models so replacing words with synonyms won't evade the model as long as its expressing the same thing.


That was my first question too. Sounds like their focus on tangible facts (many of which I presume could come from other documents) also helps mitigate the risk of being gamed.

Out of interest - why didn't you start a hedge fund instead?


I think big part of a strategy like this is long term. If you want to short a stock you need to make sure the fraud is discovered by other investors. So you probably need to have a strong marketing department that basically thrashes other companies, this may not be a goal a lot of people aspire to. With a short you're also exposed to lose an infinite amount.

And finally "markets can stay irrational longer than you can stay solvent." -Keynes


My guess is they also have plans to move in on other sorts of fraud detection, for example bond issues or privately held firms. That’s my good faith interpretation.

My bad faith interpretation is they don’t trust their algos enough to trade on it themselves.


I guess I'm not even saying it's bad/good faith. I just literally wonder - why not do the obvious thing? A number of years ago I ran into several people trying to sell services predicting stock movements via twitter sentiment. Same thing - if it works just make billions fast using it to take positions. It's not trivially easy to start a hedge fund, but it's not rocket science either. The key ingredient is finding a way to make good returns. These twitter people seemed to have good intent.. it was just strange.

See response above. And not just fraud detection. We plan to expand to other areas including credit/bankruptcy, ESG and tracking macro changes and trends

Right, and you can make money in most cases (and still publish). Buy credit default swaps for example.

I don't know if there is an opportunity related to ESG.


What you are saying is true, but it's a little bit superficial. There are many successful long/short funds and even a few short only funds. The risks are largely manageable risks.

The upside to starting a hedge fund is astronomical. And if what they claim they can do they can in fact do, the best business model seems to be start a hedge fund and manage the risks doesn't it?


A few reasons but the primary one is impact. Our mission is corporate accountability through information transparency. Once you become a hedge fund your incentives for information dissemination become warped.

How would your incentives not be aligned? On the surface it seems the following would be true: assuming you are shorting companies who are "crooked", you'd do a write up as to what you found and publish it, just like people do with shorts now.

I could easily be missing something related to your specific goals or situation, but on the surface I don't see it? Have you seriously explored it? I mean, if this thing you have works, you might be leaving a billion dollars on the table...


Hi, great idea! I tried to do something similar for earnings call transcripts and got very stuck very fast, so I appreciate that this is quite a tough problem. My suggestion would be to indeed see if your algos can extract useful structured data from earnings calls too, as often c levels give information here that’s not necessarily included in the regulatory filings. Good luck!

Thanks for the wishes! It took us a year of research to get here. Thanks for the suggestion about earnings calls!

Quant with some NLP experience here. Impressive business traction so quickly, good stuff. Who do you see as competitors in this space? I'm wondering whether you think that footnoted* (https://footnoted.com) offers something similar, or might your products be complimentary in some way? Thanks.


We love footnoted! Michelle Leder of footnoted has been really supportive of our work. She recognizes that it's an important area for tech innovation. footnoted isn't operating right now but when it comes back, we'll support her and collaborate if we can.

We're often compared to AlphaSense, Sentieo and InsiderScore but our product is pretty different. Competitor products focus on sentiment and linguistic metrics or search etc, not on extracting and organizing important textual content.


Former equity analyst intern and now a software engineer here. My Dad also works in the investment world (and is a CPA and a developer coincidentally) and after Sino Forest happened, he wanted someone to parse annual reports and AIFs create a "weasel word index." Ever thought of doing that?

Basically rank companies in estimated honesty by the language they choose to use.


Aw cool! Can I hang out with your Dad? ;) We do pick up on overly promotional/jargon-y language to some extent. For the most part, however, word lists haven't worked for us.


Interesting site. I currently work at a hedge fund, but have a small dose of NLP in my academic background, so it's always interesting to see concepts like this come out.

Two questions: - Are you using EDGAR's 'Facts' function? It seems to make SEC Filings a lot more like structured text than they have been previously, but I haven't seen really convincing tools developed to use it yet

- How/do you ever see yourself interfacing with similar 'red flag' screening tools that just work on the numerical side i.e. accounting ratios ?

Also, you've got a grammatical error on your Values and Vision page. Normally I wouldn't comment to point that kind of thing out, but for an NLP startup it seems more appropriate ('its volume' not 'it's volume')!


Thanks for website edit! Fixed.

We don't rely on XBRL for parsing. It's not very consistent/reliable and its mostly for numeric content. We've definitely considered integrating ratios both into our dashboard. It isn't a current priority because ratios are already well supported elsewhere.


Asking to learn - ex-Algo here. May I ask where else ratios are represented. Thanks and all the best for your launch. Very useful service.

From experience, I would suggest you have way for the manager of a fund or a desk or a bank, to see the usefulness. You will have good pull from the line staff, but selling to the managers is the hard part.


Thanks for your wishes! For ratios, We love Alpha Vantage. They are great value for money.


Does your model flag NKLA? Nikola Motors, ostensibly a fuel cell vehicle company, famously reported $36,000 in "solar revenue" (they got a contract to put solar panels on someone's house) while simultaneously having a market cap of over $4 billion.

If so, can you comment on what your model considers red flags in Nikola's SEC filings?

https://hindenburgresearch.com/nikola/

https://sec.report/Document/0001731289-20-000012/


Yes, it did. BUT here's the thing -> Nikola was first a SPAC and by the time it filed its first filing as Nikola, the Hidenburg report was already out so its not a true example of our algorithms preceding short reports. Our algos did beat Hidenburg to the punch on a number of other companies though.

A bit outdated but check this out - https://bedrock.substack.com/p/bedrock-ai-vs-activist-shorts


Nikola red flags for you! (all algorithmic)

1. "For example, in September 2020, our founder and former executive chairman, Trevor R. Milton, stepped down from his positions with us."

2. "During the fourth quarter of 2020, the Company ceased operations related to the Powersports business unit in order to focus on the Company's primary mission of commercial production of semi-trucks and construction of hydrogen fueling stations."

3. "As of December 31, 2020, we have $46.3 million of prepaid in-kind advisory services remaining which is expected to be consumed in 2021 and will be recorded as research and development expense until we reach commercial production."


in-kind services is always a red flag because it's easy to fudge the accounting


just a sample


Congrats! Did you by chance work with the YC team that launched MarketBrief? I don't think they are around anymore but they did SEC Filings too, albeit 10 years ago:

https://techcrunch.com/2011/08/15/yc-funded-marketbrief-make...

Disclosure / Shameless Plug: I work on https://last10k.com ...a consumer offering for reading 10K/Q reports more efficiently


Thanks for the wishes! I have never heard of MarketBrief, this is very interesting.

We most probably couldn't have done what we are able to do today back then because the ML scene was quite nascent.


Interesting.

How do you think about backtesting? There are a few short-only shops that specialize in finding frauds. If you get their historical 13-Fs, how would you score against them in terms of precision/recall?

And I guess more broadly, how does alpha with your system compare to a portfolio that holds all short positions by big long/short funds (ex thematic shorts)? Meaning, those guys have full-time humans that focus on this... can you beat them? Very interesting if so.


RE: backtesting we use SEC enforcement actions related to fraud (10b-5) as our gold label. That said, there is no gold label for absence of fraud so our backtested metrics are probably slightly better than they seem.

We've never tried to score a short fund on their precision/recall but unofficially, Hindenburg Research has the highest concordance with our models in deployment.


Re: alpha -> our focus is on extracting red flags that are similar to what a forensic accountant/analyst would find. AI-assisted research rather than AI driven. trading on fraud signals alone is pretty hard, you need another event. we havent done quant testing of our risk scores in a while. We definitely should do proper quant backtesting though.


It seems to me all you have to do to fool a ML model is rephrase sentences and just use verbiage it has never seen before. So why isn't your tool just the latest item on a pre-IPO company's checklist that is looking to commit fraud to scrub against?

Also if Muddy Waters can do something similar to this (if not better?) doesn't that mean you should just hire them and throw away your ML model?


It isn't that easy to reverse engineer our models. We test our models for a certain degree of robustness (same concepts expressed in different ways will be generally picked up). Plus, if companies go out of their way to express risks in cryptic terms, our models picks up needlessly convoluted text as well. We also constantly retrain our models so I don't foresee this becoming a problem in the foreseeable future.

Why can't there be an adversarial GAN model that tries to minimize your model's score by using text that your model has not seen yet, ad-infinitum?

How would you address would-be filers from using your own product from iterating on their wording until the red flags are removed?


We don't sell to filers. Ever. Got to protect against conflicts of interest. That's one of the reasons we don't have a true free version of the product.

That said, we're using language models so replacing words with synonyms won't evade the model as long as its expressing the same thing.


Do you have any deeper protections than simply not selling to filers? It doesn’t seem so hard for a motivated filer to circumvent by using friendly hedgefunds to lend their licenses.


We constantly retrain our red flag models and they are tested for robustness (whichever way the company decides to express the existence of a risk, like say an investigation, we ensure that the models pick it up)


Realistically they probably don't need to do that. The filings are usually pretty heavily lawyered up and the lawyers are pretty lazy on language updates.


Yeah, people have been doing sentiment analysis of press releases etc for years now and very few corporates bother trying to use software to test/counteract it


Companies are already doing this. IR consultants have, allegedly, been advising exec teams to use/not use certain words based on their impact on AI models. It is inevitable.

SEC filings are somewhat more robust because they are a standardized format but the point of many of these frauds, for example SinoForest, was about things which weren't in the document not the things that were (the stuff that was in documents was never a smoking gun...iirc, I remember looking at the stock before it happened...a lot of these Chinese frauds had totally fine numbers though, that was the point).


As someone who has looked into alternative data business models for the finance industry, this is really awesome to see someone doing this as a company. I was interested to understand how you think about your revenue model? I feel that if your data provides alpha (i.e. selling before other people are aware of the problematic disclosures), as your models become validated within the industry, someone/some firm is going to use it to generate alpha. But then you have a problem where, that one firm that captures most of the value, and takes it from other participants who now lose the value-add of your product.

How do you balance those two sides? I mean it as a potential customer who would love to pay for your product, but want to understand how you prevent this becoming a alpha-generating NLP strategy for one hedge firm who pays the most for it.


It's definitely something we've discussed as a team. We would like to help make fraud less profitable.

The current iteration of the product requires users with some level of financial expertise - hence why we are starting with fundamentals investors. We believe that the longer a fraud goes on, the more people get hurt - so we want to bring these issue to the forefront. Perhaps each trade can be considered a zero-sum game, but long-term there is a benefit to all market participants. We love the idea of every investor considering how aggressive accounting/reporting informs management integrity. Unfortunately, I think we have some time before this becomes the norm.


> Most public company data is unstructured and textual.

This is surprising to me.

Wouldn't it be possible to make a very short list of, say, 12 blunt questions that would help flag fraud?

Of course the company could be lying in the answers. (If it was found to be lying in any one answer then it would automatically be flagged as "high risk".)

But this may not even be needed, since what you seem to be saying is that the information about fraud is there in plain sight, yet hard to see, because it is drowned in fluff and periphrases.

Is there an opportunity for a private company, say a rating company, to make and distribute such a questionnaire? Or does it already exist in some form?


An interesting idea. I'm not sure if say 12 or so questions could cover the risks, or it if would just add one more data point for people to (not) read?

There are certainly some sorts of transactions that are much more risky than others - so having an easy source for these transactions would be useful (and we'd built it into our models). But often the signals are more subtle. Companies with overly aggressive accounting policies across the board tend to have completely different undisclosed problems. Since there are estimates and judgement involved in all areas of accounting looking at the aggregate impact of all policies can be important.

On adding more information - The SEC adopted rules to modernize disclosures of risk factors in 2020, requiring a summary risk factor disclosure if the risk factor section exceeds 15 pages (https://www.sec.gov/news/press-release/2020-192).


> or it if would just add one more data point for people to (not) read?

Intuitively fraud should be correlated to specific accounting practices, or maybe simply to the number of hours billed by the auditors, who are likely to charge more if the work is more "complicated"... Disclosing the amount paid to auditors could be one of those questions.

An analogy would be the "Joel Test" (2000) [1]; checklists in general, and direct questions, are much more revealing than blurb written by the target or a communications agency.

My two cents.

[1] https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-s...


Are you at the risk of being sued by companies that you give a "fraudulent" score to (correctly or incorrectly), maybe a class action? Is this one of the reasons why there is no public version of your tool?

Very intriguing.

Can you give a sense of your model(s) metrics? Sensitivity and specificity in validation/test? If you're open to it, explainability assessments?

What is your market size? Can your models transfer to other compliance spaces?


Our product has two very distinct outputs 1) red flags and 2) a risk score. Our primary value-add is the red flags which are a qualitative input into an analysts process. We find the information you need to see without having to spend 3 hours reading nonsense. The two components get tested very differently. The red flags explain (sort of) the risk scores but not in a truly explainable fashion. The risk scores are optimized for precision and approximately 1 in 3 companies with a "fraud" score will end up with an SEC investigation or equivalent.

Yes! the NLP research we've done can be adapted to other tasks.


Great work! I work in financial services and built a similar tool for sentiment analysis and topic modelling on transcribed earnings calls. The idea was to identify topics at a speaker-level (analyst, management, etc.) and evaluate the sentiment around that topic. For example, perhaps "foreign exchange" was discussed negatively by a given company in an earnings call, which would alert the analyst to review that call in greater detail.

Are you guys thinking about incorporating something like this into your product?


Thanks! Earnings transcripts sentiment analysis is not currently in our product roadmap. One of the reasons is that such tools are already available on the market. We wanted to explore more underutilized sources of information.


Hey there, I have some experience summarizing 10-K and 10-Q's for a personal project, I clusterd BERT embeddings then selected the clusters that had a high correlation with non standard price deviation in the next 20 days. You can check out the results on https://eclect.us/

I'd love to learn more about the methods you're using and I would also like to know if you're ever looking to hire more developers.


I had a look at your website and the related blog post. It looks pretty cool! We are working on some abstractive summarization projects but haven't deployed them yet. We are not hiring right now, but please do send me a note at suhas @ the bedrock ai domain.


Really interesting and useful if works.

Do you consider validating financial figures against Benford’s Law distribution? Whilst it is not NLP I am curious does it still work?


Congrats on the launch guys! Fellow Canadian from Montreal here :) I'm curious, what tools / methodology did you follow to generate the high-quality labels, and how many different labels did you end generating? I'm also very curious whether you view the discovery and generation of new labels (and accompanying high-quality training datasets) as a continuing and core part of your development going forward?


Ooh I love MTL. That's the first place I lived in Canada. Great q. We used SEC enforcement actions related to fraud as our gold label (fairly common practice in academia). Really key thing here is that you need to be careful about what years you're using for training because if you include years that are too late in the fraud cycle, you end up with significant target leakage e.g. the filings we'll say, "we're being investigated for fraud". We ended up manually reviewing all of our data/labels. It took over a month. We also use settled class action lawsuits as silver labels. Plus a few other more frequent labels as bronze labels


Thanks for the quick answer! Follow-up around this as it's a space I'm actively working in: did you use or build any tools for the labeling process, or was it Excel? :D Also, do you ultimately see/position your solution as an AI-powered exploration tool that allows humans to derive better insights, faster (but where the NLP side of things is simply to assist in this discovery process), or do you see the models (and resulting flags) eventually being able to completely replace the human intuition?


Our annotation process has been manual so far, but we are working on building something to make it more efficient :-) We see our solution as an AI-powered assistant for qualitative research that makes the job of an analyst much easier, and don't see it as 'replacing' humans for the foreseeable future.


I've read many SEC filings (mostly 10-Ks, 8-Ks and 4-Qs), and I led a project to develop an UIMA-based analysis platform for them.

What is shocking is the lack of diligence that companies can get away with. The "risk" section (§6 of Form 10-K) is often token info copy-&-pasted from other filings, notably from competitors, instead of having an entity's actual risks in it.

Good luck with the business!


Thanks! Yes, we keep track of risk factors that are just copy-pasted across companies. There is so much of it!

Great stuff.

Do you have any data/indication of the outstanding short positions vs. when you detect something, i.e. are these generally stocks shorted quite a bit at that time? Could they even be borrowed/how expensive were they to borrow?

These kind of analyzes are definitely interesting (and are happening elsewhere I think, too). So really cool area!


Given your example boilerplate disclosure, is one of your key conditional statements simply categorizing boilerplate? Like instead of simply putting a sentiment score on the wording, you first isolate it as boilerplate or as a unique potential aberration and then assign weights?


You've hit the nail on the head (mostly)...but I'm going to say no more because its part of our secret sauce


Yes, I figured! I found your synopsis to be inspirational

An additional tool I always wanted was to match external - even macroeconomic - events to company risk factors


There have been several popular books about famous fraudsters. I suspect that you come across some facts that would be interesting (if not profitable) not just to institutional investors but to the average nerd, or maybe someone looking for an idea for the next bestseller.


Like this? This Oxford City Football team, a boiler room operation, duped people by saying that they were bound by a voice recognition and proved it by using the phone to make some beeping sounds. https://www.sec.gov/litigation/litreleases/2017/lr23869.htm


"Lying For Money" by Dan Davis has several chapters about financial fraud. It's a great read.


Or this? John Rohner, claimed that he'd developed tested and patented a "plasma engine" fueled by inexpensive and abundant noble gasses. He also claiming that he graduated from Harvard at 14 and had 3 Phds from M.I.T. Somehow he managed to dupe 98+ investors


now i can see a Yelp-style service when a company pays to check the draft for red flags and to get it edited appropriately before actually filing it with SEC :)


I love symmetries. Is the reverse of this also true? Could there be disclosures missed by investors that if unearthed would positively impact the company?

Yes! That is something we are looking forward to working on in the future.

> in the company’s disclosures including buying and selling from companies controlled by their directors

How did you find that companies are controlled by their directors?


It's disclosed in their filings e.g. "Among the vendors were a director of the Company and an entity controlled by such director"

At lot of egregious things are disclosed on page 101 of a filing but they get missed because these filings are so long and deathly boring


Just curious, how many 10ks did your team read thoroughly for making good labels ?

Update: I see you mentioned 10b-5 as your gold labels


Do you have any plans to take S-1 filings and make “cliff note” versions, i.e. distill down to core financials and important data?


Yes, this is at the core of what we do - taking a document with 2000 sentences like an annual report and distilling it down to the 20-30 sentences that actually matter. We do have plans to process S-1 filings, and we are currently working on building abstractive summarization capabilities.


>> Information drives financial

That's not true. Government money printing drives financial markets.


Are you guys able to access the Ledge site? Seems down for me but the main site is fine

It works fine for me now. Please do let us know if you are still unable to access it.



Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: