More

izyda · 2024-05-25T04:54:19

The two other main competitors, aimed at startups (and really venture investors) are Pitchbook and CB Insights.

There are several equivalent products like CapitalIQ (owned by S&P Global), Preqin, and offerings from Factset/Refinitiv that are aimed more at private equity investors (later stage) but also include some startup data.

Finally, there are specialized startup data providers like Harmonic.ai (in depth scraping of stealth startups), G2 (Yelp for enterprise software), or Clay.run (innovative UI) that all specialize in something specific but are not at the scale of the above.

How do they get the data?

The first place is SEC Form D Filings. These are required in the US after private funding rounds (lots of caveats, if, buts, etc. but let's keep it simple). This data alone can give you a decent database to start with. After that, it is web-scraping news articles, news wires, LinkedIn, etc. For very specialized areas (ie. Dev Tools), specialized data sources (say Github Archives) might be useful.

Most importantly, many of these providers aim for give-to-get dynamics. Once they become popular enough, startups will actually seek out having a profile (create data) or fix incorrect data (contribute). This is a great dynamic, of course, because it essentially creates proprietary but free data collection.

Websites like TheOrg.com have done a nice job with org charts -- they take a guess at who you report to... and a lot of employees, annoyed at being "layered", will freely fix the data. If you get enough volume, you create a give-to-get flywheel.

I agree with you what is valuable here is the proprietary data. But, behind that, is the _process_ for creating the proprietary data. You could get very good at web-scraping, parsing esoteric government filings, etc. And, maybe that space can get disrupted by someone better (say with LLMs). But ultimately, if you can get users to contribute data -- that's the "promised land" in DaaS.

I also think UI/interface is not value-less. Companies like Clay.run have done a great job making proprietary data accessible to more users. There is value there -- but the data owner collects a (fair) toll on that.

robinyapockets · 2024-05-28T07:50:59

Thanks for this overview!!!

SEC filings + crowd sourcing content seems the way to go. Plus, who wouldn't want to celebrate their latest funding round :)

Curious, how much would you pay for a service where you get the same data as crunchbase, but with a delightful UI, focused on Pre Seed to A, in a vertical like "Dev Tools"?

izyda · 2024-05-29T04:03:31

I think there is a subset of VCs that would pay for this.. unfortunately, that very particular subset of VCs has the smallest budget to pay for things based on their fixed fees/fund sizes.

Firms like CapitalIQ or Pitchbook have their largest contract with giant asset managers for whom a 6- or 7-figure deal would be a very small percentage of AUM (and thereby small percentage of management fees).

For angels/seed stage VCs, you are likely looking at "pro-sumer" like prices. So, something like 100-1000/month at most.

izyda · 2024-04-27T04:20:27

There is a new approach of using static site generators to make BI pages feel instantly responsive.

- Evidence.dev (https://evidence.dev/) - Observable Framework (https://observablehq.com/framework/)

I have found that both the speed and the frontend control either of these tools gives you is pretty good (with Evidence looking better out-of-the-box just in my personal opinion).

The main problem I always had with the embedded versions of Tableau, Looker, etc. is that they felt super canned (it was obvious it was a poorly/"lightly" white labeled solution -- when you see an embedded Tableau dashboard, you _know_ it is Tableau, etc) and they were slow.

More on this here: https://magis.substack.com/p/an-observation-on-dashboard-spe...

PS -- I would add that the "headless" version of the above tools that I have seen is https://cube.dev/

rogansage · 2024-04-29T11:17:05

What do you think of a solution where it's headless (i.e bring your own charts / components) but with a no-code builder that gives you the benefits of an off-the-shelf tool to manage and update easily? (that's what we're building at embeddable.com)

izyda · 2024-04-29T15:15:02

It sounds like an interesting product but not a fit for us:

- We actually don't want no-code. We want configuration based (or something else that can fit into version control). - We want very nice, very customizable charts, but don't want to bring our own (our team is Python/SQL and data based; no one writes Javascript)

izyda · 2024-03-24T22:48:14

I loved this game and then this style of game (all the Tycoons, SimCity, AoE, etc.).

I did not know it, but it was my first experience with economics. I mistakenly thought that this is what policy planners _actually_ did in the real world. Imagine the disappointment when I found out that was not true.

Nonetheless, it ultimately inspired me to go into data science in market/competitive intelligence. First at hedge funds and now as my own startup.

I have never been able to shake the notion that build a real-time view of the real economy was the most interesting thing to work on.

izyda · 2024-03-22T04:04:20

I do not have a horse in the race, but it is interesting to see open source comparisons to traditional timeseries strategies: https://github.com/Nixtla/nixtla/tree/main/experiments/amazo...

In general, the M-Competitions (https://forecasters.org/resources/time-series-data/), the olympics of timeseries forecasting, have proven frustrating for ML methods... linear models do shockingly well and the ML models that have won, generally seem to be variants of older tree-based methods (ie. LightGBM is a favorite).

Will be interesting to see whether the Transformer architecture ends up making real progress here.

wenc · 2024-03-22T06:53:05

They are comparing a non-ensembled transformer model with an ensemble of simple linear models. It's not surprising that the ensemble models of linear time series models will do well, since ensembles optimize for the bias-variance trade-off.

Transformer/ML models by themselves have a tendency to overfit past patterns. They pick up more signal in the patterns, but they also pick up spurious patterns. They're low bias but high variance.

It would be more interesting to compare an ensemble of transformer models with an ensemble of linear models to see which is more accurate.

(that said, it's pretty impressive that an ensemble of simple linear models can beat a large scale transformer model -- this tells me the domain being forecast has a high degree of variance, which transformer models by themselves don't do well on.)

gradascent · 2024-03-22T07:05:31

fyi I think you have bias and variance the wrong way around. Over-fitting indicates high variance

wenc · 2024-03-22T07:08:46

Thank you for catching that. Corrected.

hackerlight · 2024-03-22T12:22:12

> ensemble of transformer models

Isn't that just dropout?

mikkom · 2024-03-22T14:31:46

No. Why do you think so?

hackerlight · 2024-03-23T00:09:54

Geoffrey Hinton describes dropout that way. It's like you're training different nets each time dropout changes.

wenc · 2024-03-23T15:03:25

Dropout is different from ensembles. It is a regularization method.

It might look like an ensemble because you’re selecting different subsets but ensembles combine different independent models rather than just subset models.

wenc · 2024-03-23T17:48:41

That said random forests are an internal ensemble, so I guess that could work.

In my mind an ensemble is like a committee. For it to be effective, each member should be independent (able to pick up different signals) and have a greater than random chance of being correct.

hackerlight · 2024-03-24T03:20:36

I am aware it is not literally an ensemble model, but Geoffrey Hinton says it achieves the same thing conceptually and practically.

one_buggy_boi · 2024-03-22T04:55:08

Are these models high risk because of their lack of interpratability? Specialized models like temporal fusion transformers attempt to solve this but in practice I'm seeing folks torn apart when defending transformers against model risk committees within organizations that are mature enough to have them.

tomrod · 2024-03-22T10:44:22

Interpretability is just one pillar to satisfy in AI governance. You have build submodels to assist with interpreting black box main prediction models.

rdedev · 2024-03-22T05:43:59

Is there a way to directly train transformer models to output embeddings that could help tree based models downstream? For tabular data tree based models seems to be the best but I feel like foundational models could help them in some way

izyda · 2024-03-07T17:37:27

It has been very impressive to see what LLMs can do for transforming data into useful structured datasets in Snowflake.

Snowflake Cortex has been critical in this process for open-source and Mistral models. We have also found that using GPT4 and Claude3 to be meaningful within Snowflake. Setting up connections between Snowflake and these LLM providers is not too complex but is annoying, so we've done it once so you don't have to.

izyda · 2024-02-22T21:29:14

The lack of revocability, marginal temporal value, and downstream governance I think makes the prospect of more such data deals happening slim -- or at least, slim without regret.

I wrote an essay on this here: https://magis.substack.com/p/llm-data-sales-a-market-for-lem...

izyda · 2024-02-21T20:15:15

Cybersyn makes daily trading volumes & prices of all US equities/ETFs executed on the Nasdaq available in your Snowflake instance for free. Data is inclusive of pre-market/after hours activity and is released daily at 6:00am ET. Learn more in Cybersyn Docs:

https://docs.cybersyn.com/getting-started/concepts/stock_pri...

izyda · 2024-01-29T03:50:44

I have been wondering how to support interactive / real-time web apps based on Snowflake data. I suppose pushing down to DuckDB a subset of data needed for a chart would be one way to do this...

hipadev23 · 2024-01-29T04:49:30

If you’re pushing down the data, you’re losing the real-time capability no?

If you want fast, adhoc, real-time querying, load the data as it’s created directly into duckdb or clickhouse. Now you’ll have sub-100ms responses for most of your queries.

alextheparrot · 2024-01-29T05:10:28

I'd assume they mean users interacting with the chart vs first load. So the user sees the base chart (Let's say 1MB of data on the server, less depending what gets pushed to the user) and then additional filters, aggregations, etc. are pretty cheap because the server has a local copy to query against

izyda · 2024-01-29T05:25:39

Yes -- sorry, I meant exactly the above.

izyda · 2023-12-22T15:54:55

Access global emissions data sourced from Our World in Data (OWID) in an intuitive form, directly in your Snowflake instance.

Docs here: https://docs.cybersyn.com/public-domain/environmental-and-so...

izyda · 2023-12-19T20:13:53

Our team at Cybersyn aggregated 300M+ domains in a single source. Domains are cleaned into a standardized format with any protocols and subdomains stripped away.

For a subset of domains, the dataset includes information on redirects such as a website’s redirect domain, the start/end dates for which the redirect relationship was observed, and whether or not a domain is the primary landing page. HTTP response statuses indicate whether a domain is active or inactive.

More info in Cybersyn Docs: https://docs.cybersyn.com/public-domain/technology/tech-inno...