Hacker News new | past | comments | ask | show | jobs | submit login

This feature is implemented at a low level, and works on the command line.

For example if you have a directory that is all stored in the cloud you can `cd` to it without any network delay, you can do `ls -lh` and see a list with real sizes without a delay (e.g., see that an ISO is 650 MB), and you can do `du -sh` and see that all the files are taking up zero space.

If you open a file in that directory, it will open, even from command line, then do `du -sh` and see that that file is now taking up space, while all the others in the directory are not.

You can right-click to pin files and directories to be stored locally, and right-click to send them back to the cloud so they don't take up space.

This is actually very different than traditional network file systems like SMB, NFS, WebDAV, and SSHFS. With a normal network file system over the WAN you would have major latency problems trying to `cd` and `ls` the remote file system. Most of them also don't have any ability to cache files locally when offline, or the ability to manually select which files are stored locally and which are remote. AFS does have some similar capabilities.




This is how Bitcasa is working I think. You would see all the files, in a fake hard drive of unlimited space. Some magic (machine learning and smart decisions) would try to figure out ahead of time my data access pattern.

So I would see all my pictures list, if I decide to open the first one, it would take a few seconds to download, and then I start browsing the pictures, it would figure out I plan to look at all of them, and pre-fetch them from the cloud, so there would be on average no perceived latency.


Was working (didn't they just pull the plug on all their "unlimited" accounts?)


They pulled the plug on all their "unlimited" accounts a long time ago.

They recently shut down their free plan.


Datto Drive just launched today and is giving away a TB of space for free: www.dattodrive.com


So, if I try search inside Dropbox or Documents folder it will download all files from Dropbox to my computer ?


(First off, as a disclaimer, I no longer work for Dropbox, I don't speak on their behalf. I've only used the feature as a user.)

I don't know a common search/find system that open()s or read()s files during the search by default. AFAIK Spotlight and Windows search are indexed searches. As for the indexing operations, I don't know how that is handled, they could disable indexing for remote files, or they could somehow integrate with indexing.

Based on my testing of a pre-released version of the feature (it isn't released yet), if you were to do something like `find ~/Dropbox -type f -exec md5 {} +`, it would download files.

As a user it did exactly what I expected. I was truly amazed. It was totally seamless and amazing.

Compared to the complexity of what has already been implemented, solving the problem of "I want to recursively open/read every file in my Dropbox, but I don't want it to download terabytes of data and fill by hard drive" seems fairly simple. For example there could be a setting for the maximum about of space Dropbox will use up, e.g., 40 GB, plus Dropbox could be smart enough to detect disk usage. If you `grep -R` it may download/open/read the files, once you reach 40 GB or near your disk capacity, Dropbox could start removing local copies of files that are not pinned to be local, i.e., remove the files that were downloaded because of the open()/read(), not the files you explicitly told it to keep local. I don't know how the team will choose to implement these features, but I'm confident that it will be well-thought-out and tested.

Remember, Dropbox is the company that especially monkey patched the Finder to get the sync icons (http://mjtsai.com/blog/2011/03/22/disabling-dropboxs-haxie/). They will go to great lengths for a seamless user experience, and do a ton of testing. I have no doubt that when Project Infinite is widely available it will be amazing, seamless, and have functionality many people thought wasn't possible or only dreamed existed.


> I don't know a common search/find system that open()s or read()s files during the search by default.

... grep?


I don't think antoncohen meant searching file contents, but point taken.


and antivirus and I'm sure there's more.


Valid question. I wouldn't be happy if I ctrl-f on "My Documents", do a search and a 1 TB download starts up invisibly in the background filling my hard drive.


Perhaps the integration will extend to triggering a search on the back end.


I suppose any company that is giving all their encrypted data to Dropbox to begin with may be OK with it. But most companies are already sketched out by the mere fact that their data is accessible to anyone outside the company.

In any event, if they were to index and provide search as a service as well, I wouldn't think it's something they do quietly. It would most likely include it's own huge marketing campaign.


Could Dropbox detect repeated access patterns from the same process, and/or whitelist processes as known "searchers," and start returning blank files? This seems like the kind of problem only a unicorn would dare to tackle, but as luck would have it...?


So search is broken in these directories?

That seems like a lousy tradeoff.


You want to save space by not having data on your local system but use a local search to look in the contents of files not on that system? You can't have your cake and eat it too.


Sure you can: index it once, and then stream it forevermore.


except remote contents can change


Store a hash or checksum in the index and also allow the remote API to return the hash/checksum to see if the file has changed


I believe this is not the case here. In order for the files to start taking space on your drive you would actually need to right click that folder and choose "Save a local copy".


This is what I was wondering - we'd have to be careful writing a script that happened to traverse into the dropbox folder, because it might try to inflate all the files. It still seems like a cool idea, but I wonder if they have a workaround.


Well, the feature isn't released yet. I'm sure aspects of it will change based on test user feedback.


I wonder how this will work with Spotlight.


Spotlight is enabled by default _and_ left enabled on basically all the Macs and Mac users are actually a big userbase for Dropbox. It is very unlikely Dropnox team will forget that Spotlight indexing is running in the background.

Does not mean the files will get indexed, but there is no chance that Spotlight will trigger a unexpected terabyte download in the background.


Why would it do that? Can't it just be added to the search index and then "archived"?


grep doesn't have a search index.


But wouldn't you expect a grep to download the files so they could be searched, as opposed to a locate or find, which I wouldn't expect to do that.


I thought I was inquiring about Dropbox, not grep.


"This is actually very different than traditional network file systems like SMB, NFS, WebDAV, and SSHFS. With a normal network file system over the WAN you would have major latency problems trying to `cd` and `ls` the remote file system."

Is that still true for sshfs ?

People used to ask us if they should rsync to us directly or sshfs mount and then rsync to the mount, and we told them not to do that since the original rsync stat would basically download all files simply to look at them / size them.

But I don't think that's the case anymore. I think sshfs (or perhaps something about FUSE underneath) is smart about that now ... isn't it ?


I haven't used FUSE SSHFS in around 8 years. I'm sure it has improved a lot since then. I could imagine it handling file listing and stat better than other network protocols (cd/ls over ssh works well over most WAN connections). It looks like it now caches directly contents too (https://github.com/libfuse/sshfs). It probably wasn't fair for me to include SSHFS in the list since I haven't used it in so long. I was troublesome when I used it.


> People used to ask us if they should rsync to us directly or sshfs mount and then rsync to the mount

Are you guys allowing full access to the machine now through LXC containers or some sort of VM?


"Are you guys allowing full access to the machine now through LXC containers or some sort of VM?"

No - it is the customer, on the client side, that creates an sshfs mount representing their rsync.net account.

It works very well and it is very nice to have a plain old mount point that represents your rsync.net account - especially since you can just browse right into your historical ZFS snapshots, etc.

But in the past, people did that and they got the bright idea to rsync to that local mount point, to do their backups, and that didn't work well.

But my understanding is that nowadays it would work better - you wouldn't download every single file that rsync simply stat'd or listed ...

We still don't recommend it, though. No reason to add that complexity.


This is similar to the idea behind git-annex.


git-annex also has the advantage of letting you keep data in multiple destinations (including in cold storage), which I think is becoming increasingly important.


Is it not the same way Skydrive on Windows 8+ work ?


They dropped that feature (and the ball) in Windows 10 https://www.thurrott.com/cloud/66733/project-infinite-bring-...


I would be worried if Microsoft couldn't get the feature working correctly and had to drop it...


lol


Yeah I think it's largely the same functionality as OneDrive/SkyDrive has, it's really useful for businesses who've uploaded many TB to it.


This is sort of like how onedrive used to work on win 8. But they dropped the feature later.


> and you can do `du -sh` and see that all the files are taking up zero space.

That seems wrong to me. It would violate the assumptions of software that does stat() on directory entries and not only verifies presence but also non-zero size.

So it's risking buggy behavior to gain a latency edge over other networked filesystems. I think smart prefetching while preserving correctness would be better.


du uses `st_blocks`, not `st_size`, so it should be fine for most applications. It is similar to how sparse files behave:

    $ ls
    $ truncate -s 1M foo
    $ ls -lhp
    total 0
    -rw-r--r-- 1 catwell wheel 1.0M Apr 26 18:39 foo
    $ du
    0    .


All is good then.


du -A for the "apparent" size.


I came to ask if this is really that different than Selective Sync. Glad to hear it is.


So, is it working off some kind of always-in-sync (assuming a live network connection) manifest?

I confess, I did not watch the video, and only briefly skimmed the announcement.

Edit: I'm asking a genuine question of real technical interest here. How can this be implemented with no latency and real file sizes immediately available for inspection, while taking up no disk space? I went back and read the announcement again, and there are no hard details I can see that I missed in my initial skim. There has to be something stored locally, right? Hell, I'm running gigabit fiber here, and I still notice latency in the CLI for anything that requires a network connection. Perhaps I misunderstood the parent?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: