I checked the code and found the issue. It's a result of Gemini's larger context window.
Basically, the Foxtrot scraping library sends the page in chunks. The chunk size is capped at the max context length of each model, which for Gemini is 1,000,000 input tokens for lite. That's compared to 128,000 for GPT-4o-mini.
Typically, you won't need all the tokens in the page, and sending a million tokens when 100,000 will work is wasteful in terms of cost and runtime, and can also hurt accuracy.
I'm going to re-run the benchmarks with a cap on the prompt size for models like Gemini.
I want to get my genetic data, but, like obviously I don't want to go through one of these services where they ingest all that data and keep it around forever. Honestly, I'd like to be the only person with access to it and I can destroy it at will.
Tough requirements, I know.
Anyways, do you know of any services that meet those reqs? Any good DIY ideas?
I didn't follow competitors too closely, but Color may be what you're looking for. I don't know if they sell direct to consumer though.
If you use 23andMe, and request data deletion, they will do a best effort to delete the data. It's part of their GDPR requirement. When I was there, I worked on this project and they put a lot of effort into it. A big chunk of engineering org focused GDPR compliance for a month or two. They definitely don't intentionally keep data around if you request deletion.
The one caveat is that data deletion is hard, and its possible that some gets accidentally retained. I left the company over 5 years ago, so I don't know how good their deletion process is now.
One final note: to keep costs low, 23andMe doesn't look at your whole genome. They only look at a handful of "SNPs" in your genome that are known to be significant. If you've heard how we share 99.5% of our DNA with chimpanzees, this is what they're basing this on. They look at the <0.5% of DNA that commonly varies between humans.
The reason I mention this is that, if you're very interested in your DNA sequencing, you may want to opt for a higher cost service that does full genome analysis. I don't know any names but I believe there are some DNA services that do this.
I requested deletion of all my data to 23andme, but they said they keep “Genetic Information”. Does that mean 23andme still has my “SNPs”?
(I’m based in Europe)
Message I received by email:
> 23andMe and the contracted genotyping laboratory will retain your Genetic Information, date of birth, and sex as required for compliance with legal obligations, pursuant to the federal Clinical Laboratory Improvement Amendments of 1988 and California laboratory regulations.
> 23andMe will retain limited information related to your deletion request, such as your email address and Account Deletion Request Identifier, as necessary to fulfill your request, for the establishment, exercise or defense of legal claims, and as otherwise permitted or required by applicable law.
Not directly, afaik they never transferred the data.
However they sold access to the data to a bug pharma company (GSK). This was widely publicized. Not sure if that counts: GSK had some ability to look at the data but didn’t have an on premise copy of it.
Also, I worked on the GDPR deletion project. I can attest that they do best effort to delete your data when your request that. At least when I was there, this was the case. One caveat is for coding errors, oversights and bugs.
I had direct knowledge of this. To clarify I believe 23andMe did not give direct access to individual's DNA data.
What 23andMe was selling to GSK was the results of GWAS (Genome Wide Association Study) results, which could be used to generate therapeutic candidates.
GWAS is a sort of rudimentary machine learning algorithm that basically maps a phenotype (like propensity for a particular disease) to a region of DNA. From there the drug company can narrow down candidate genes to attempt to develop specific drugs for.
No, generally not. 23andMe only ever did ISOs prior to the SPAC merger, not RSUs, and they had a quite high strike price. There was a 6 month insider lock up against selling, so by the time most of us could sell, the price had tanked. By the time I was able to sell, the share price was so low that only my oldest options from when I started at the company in 2014 were in the money.
I ended up making a few 10s of thousands. Not the 100s it would have taken to compensate for the low pay for all those years. And I probably did better than most by selling what I could when I could. Most people weren't in the money at all.
Maybe if you timed the market perfectly you could have done well. I don't know that anyone did.
Was there an internal sense that the business model was flawed? You can only sell so many DNA tests to people, and the pharma research angle always felt like more of a pipe dream than a viable business.
The company bet heavily on pharma/genomics, and it was a bad bet.
When I was there, people were pretty confident in this bet. They had just signed a huge deal with GSK, so it seemed to be going well. There wasn't widespread dissent at the time (~2016-2017). I imagine its different now that the stock price has crashed over 10x.
The company did follow Ancestry.com pretty closely. Ancestry did not bet heavily on genomics. Instead, they bet heavily on a subscription model and focused more on consumer interest in their ancestors. This has worked out a lot better for them than 23andMe.
FWIW, I agree it's obvious in retrospect that pharma was a bad bet. Leadership should have made better decisions.
As usual: “It depends’. Data on gene variants related to the first steps in drug metabolism can be quite useful both at home and clinically—e.g, your own responses to ethanol, caffeine, and many over-the-counter and prescribed drugs.
St Jude Children’s Research Hospital routinely genotypes/sequences children before drug treatments to optimize initial doses. It makes a huge difference in outcomes for most cancer patients.
But chronic age-related diseases that older individuals care about most are too complicated and too strongly affected by environmental factors to be well predicted by low coverage sequencing or genotyping platforms. Even deep sequencing and perfect telomere-to-telomere personal genome assemblies (still about a $10,000 to 20,000 effort) will not be sufficient. You really need the patient’s full history and deep omics data. Michael P Snyder and colleagues at Stanford are getting close to this type of “future preventive health care” with a focus on type 2 diabetes.
Polygenic risk scores based in simple GWAS results and additive genetic models are uninformative (or minimally useful) wrt clinical care for complex diseases—even those that have moderate heritability. There are simple way too many variables, too many undefined gene-by-environmental effects, and too many non-additive effects (epistasis). Polygenic risk scores typically account for less than 20% of variance in disease traits.
Coming around full circle though—-these platforms ARE useful for pharmacogenetic predictions of initial metabolic processing of drugs—- getting us closer to the right dose the first time.
And the SNP genotypes generated by 23andMe are also valuable predictors for a subset of variants that contribute to nearly monogenic disorders.
Just for some perspective from outside the US: I work at a bank in a country subject to GDPR. I have access to customer data, as do most people on my team.
I worked at a US startup from my world country and that company dealt exclusively with PII (i.e. IDs, face etc) of people, including from the armed forces, of NA and some European countries.
I had access to any data I wanted to see, download on my work laptop (we all worked remotely). I didn't have to ask anyone, I didn't have to justify it, and AFAIK it was not audited. Logged? I don't know, maybe it was. I had sent mail once regarding to a director and SVP and never received even an ack. Oh by the way, everybody had access, not just me. For that no other access type was required either. Company email was sufficient. And IIRC even the stage env. had product data and stage was truly fair game.
No, I did not misuse and used it handful of times for debugging purposes. I doubt anybody did.
The company was not very well run so I’m not surprised. Their stock price has tanked over 10x since IPO, and it dropped by half in the employee lockout period after IPO.
I haven't put together a good test framework yet, but qualitatively, the results are surprisingly good, and hallucinations are fairly low. The prompt tells GPT to say (not available) if needed, which helps.
I'm going to try the "generate selectors" approach as well. If you'd like to learn more or discuss just reach out via email (marcell.ortutay@gmail.com) or discord (https://discord.gg/mM54bwdu59 @ortutay)
It’s actually a Chrome extension, and it runs in the user’s browser. You can configure the scraping rate and the app warns against higher rates (more than 5 tabs at a time).
I don’t think any sites are going to get hammered though, even at the fastest rates. The limiting factor is often LLM token rates.
One thing to note about FetchFox: it runs as a Chrome extension. This means it has a different interaction with anti-scraping measures than cloud based tools.
For one thing, many (most? all?) large sites ban Amazon IP's from accessing their websites. This is not a problem for FetchFox.
Also, with FetchFox, you can scrape a logged in session without exposing any sensitive information. Your login tokens/passwords are never exposed to any 3rd party proxy like they would be with cloud scraping. And if you use your own OpenAI API key, the extension developer (me) never sees any of the activity in your scraping. OpenAI does see it, however.
> And is there any reliable scraping services that can actually do scraping of those large companies' sites at a reasonable cost?
FetchFox :).
But besides that, the gold standard for scraping is proxied mobile IP requests. There are services that let you make requests which appear to come from a mobile IP address. These are very hard for big sites to block, because mobile providers aggregate many customer requests together.
The downside is mainly cost. Also, the providers in this space can be semi-sketchy, depending on how they get the proxy bandwidth. Some employ spyware, or embed proxies into mobile games without user knowledge/consent. Beware what you're getting into.
Good question, I actually haven't tried it with the image capture approach. I'll give that a shot and see how it performs. I'm planning to try many different AI extractors, and see which performs best.
So far, I've done some un-scientific testing to compare text vs. HTML. Text is a lot more effective on a per-token basis, and therefore lower cost. However, some data is only available in HTML.
I had another request for the exact same thing, actually. I'm planning to separate out the scraping library from the Chrome extension, and this project would be a good use case for that library.
Unfortunately the bigger issue with Gemini is cost, which is too high for the scraping use case.
reply