Hacker News new | past | comments | ask | show | jobs | submit login
Remove first 300M lines from 700 GB file on 1TB disk (stackexchange.com)
213 points by asicsp 34 days ago | hide | past | favorite | 92 comments

My first idea was to solve this "queue" (FIFO) problem with a "two stacks" solution: Without special system calls, files can only be appended and truncated at the end, not the beginning, so they behave like a stack. Two stacks can be used to simulate a queue.

Implementation sketch:

1) Successively read the last 1 kiB from the source file, append it to a temporary file, and truncate 1 kiB from the source file.

2) Truncate the last 300M lines from the temporary file.

3) Successively read the last 1 kiB from the temporary file, append it to the destination file, and truncate 1 kiB from the temporary file.

This way, you never need more than 1 kiB of additional disk space (tune this number as desired). I think this is an elegant solution. The correct solution, of course, is to buy a second disk.

You can do it without a temporary file if your filesystem supports sparse files.

Figure out the output size, and create a sparse file with that size. You'll be able to use it like a regular file, but blocks will only take up space once you write to them.

Now simply copy from the input to the output file, starting at the end. Truncate the input file whenever the disk fills up.

Oh god, it's Towers of Hanoi with disk blocks.

> 1 kiB of additional disk space (tune this number as desired)

Definitely tune that. The file-system is probably working in 4KiB blocks so there would be no benefit in working on smaller ones. In fact as you'll write to each block four times you are probably wasting a lot of time and causing more wear than you need to.

I would suggest something far bigger (in the MiB range at least), especially on traditional drives where reducing head movements is key but random access latency on SSDs is not as close to zero as many people think particularly for small writes. Make sure that whatever you move the data with is loading that full block of your desired size into RAM before starting to write to the destination (i.e. perhaps pipe through something like "buffer -s 4M" if your read/write tool of choice does not support this directly).

In fact an SSD might internally be working with MiB sized blocks, another reason to work in that scale at least and be careful of alignment where you can detect it or the lack of it.

But yes, that solution would definitely work if carefully optimised.

> The correct solution, of course, is to buy a second disk.

Definitely. Must faster, and you still have the original data in-place so you can run checksums to verify the new copy before committing to use it.

While filesystem block is usually 4k, I think ssd blocks are in the 256k-4mb range. So 4k is probably a bad choice.

Aye, I mentioned that a bit further down.

Also if working with network attached storage you have an extra set of considerations that will change the optimal block size to work with.

You must be pretty good at whiteboard interviews.

I can pick up new technologies quickly and I am a good developer (at least my manager says so) but I can never come up with elegant solutions like this to what I call a whiteboard problem. This is the exact reason why I fail at those whiteboard interviews.

Any suggestions for me?

Maybe it's a whiteboard problem, but I would just call it a puzzle. Puzzle solving requires creativity.

Picking up new technologies doesn't require creativity. Being a good developer (these days) also doesn't require much creativity, especially if you're just implementing spec and not designing systems.

The best way to get better at puzzles is to practice puzzles! The reason I don't want to call these problems whiteboard problems is because it implies that they're only used in that context. These types of puzzles can be done for fun, too! Perhaps you start with simpler puzzles and try some simple solutions. You might solve a few. You'll try some harder puzzles and you'll get frustrated because the solution doesn't come immediately. Sometimes you'll combine two solutions you've used previously to solve something more complex. It's an amazing feeling!

If you want a book of math puzzles, I recommend "Measurement" by Paul Lockhart. He explains and helps you explore math in the way he thinks it should be taught in schools, and I agree with him.

If you want something more programming focused, take a look at "Programming Pearls" by Jon Bentley. It's a very old book, so the constraints he has to work with are quaint by today's standards (do something without using more than a megabyte of RAM. Hah!), but they're good puzzles as far as I remember.

Good luck!

Edit to add: The most useful thing you'll get from spending time with puzzles from a certain field is the intuition about which tools to reach for and which to forget about. You'll still likely have to try for a bunch of tools before you find the ones that work the best (or at all), but you'll get better at choosing ones that are more likely to work sooner, so you'll be able to progress faster on harder puzzles. As Greg LeMond said about cycling: ‘[It] never gets any easier, you just go faster’ :)

Thank for those recommendations!

> You must be pretty good at whiteboard interviews.

I never had a single whiteboard interview in my life, so I can't give you advice on how to get better at them, sorry.

Back in university, I had a "Design and Analysis of Algorithms" course, and one of the first (toy) exercises was to implement a queue using other data structures. This solution with two stacks was what I came up with. That's 12 years ago now, good times :-)

The only suggestion I have is practice, it's a skill like many others where practice makes you aware of patterns and these patterns have common or known approaches. Of course just blindly applying patterns won't cut it, the practice part is that the common ones become ingrained and you can focus your mental effort into the specifics of a new problem.

Rehearsing your knowledge of data structures and algorithms while studying algorithmic quizzes can get you pretty far, it's time-consuming though so it needs to be something you get fun out of it, I don't get much fun out of it so I have done the bare minimum that is needed for this industry...

I like this but there's a trick: reversing in blocks makes the last 300M "lines" not exactly the same as the first 300M lines of the source file.

But the basic idea still works: You could read the original file forward to find the index of the 300M'th line, and then just halt the stack reversing bit when you get there. Delete the whole original file and then stack it back.

Don't you need your read to be line-oriented to avoid breaking up lines?

You might want to read from the source file line oriented until you find the start position of the first line to copy. Then copy from the end as described in large chunks (I would do 16Mb chunks assuming ram is sufficient).

The sparse file suggestion a sibling made is really great. Otherwise, you could reverse the chunks as you write them, and end up with a fully reversed file, or just have to be careful about block boundaries and know that contents are forward, but blocks are backward. I wouldn't want to bulk read a 700g file with line processing, if it's possible to ignore that.

You win the internet today, that's a fantastic solution.

I think the only gotcha is to handle the last block read in step 1 in case it's not evenly 1 kiB. Otherwise the first read in step 3 will overlap the blocks.

You should dub this “slinky copy”

I must be misunderstanding what you mean by "Truncate the last 300M lines from the temporary file" because it sounds like it would remove the only copy of the 1 kiB chunk on the first iteration.

Oh, I think I see what you're doing here. Step 1 is actually an iteration over the entire file (I interpreted it as merely step 1 of a per-1k-block loop), effectively making a backwards (blockwise) copy of the original file but only using up 1kB of extra space at any given time. Step 2 is to truncate the "last" 300M lines, because those correspond to the first 300M lines of the original file. Step 3 is to re-reverse the temp file into the destination.

This is great, although the line-vs-block-orientation issues will be tricky to make sure you get right, as others have pointed out. The truncate will almost certainly need to leave a partial last block.

The true answer is buried about half way down the page. The fallocate command / syscall nowadays allows you to remove or insert block-aligned extents at the beginning of a file, assuming you're using a modern filesystem like ext4 or xfs. The only difficulty that remains is working out how many blocks to remove and dealing with the non-aligned part. LWN has a few articles on this topic: https://lwn.net/Articles/589260/ https://lwn.net/Articles/629965/

I don't know man, the top answer is clever, but it's also a pretty straight-forward UNIX way of doing it, and pretty easy to understand. Going around messing with trying to convert to filesystem blocks and obscure syscalls that only work on certain file systems seems way scarier in comparison. And you're almost certainly going to end up with extra characters at the beginning, so you having to do some version of the dd thing anyway.

The actual answer, as others have pointed out, is to buy a new drive. If the data has any value, it's worth it.

I worked at a place where the entire's company business (images) was stored on a single large drive, and the CEO was too cheap to buy a second, so there was no backup. Never underestimate how dumb people can be sometimes.

Was that a special type of drive that cost as much as the company's value? Was there any justification besides the cost of the drive? Even something like 'I live dangerously.'

Inaction, deluding oneself about risks and hubris all have a special value all their own. Very popular these days.

Hey colleague! Of course whatever was used to make those images disappeared some 10 years ago. And the actual mint images are already gone from an earlier disk-crash. THESE images are actually made from systems that have been running at a customer site for years and years.

> The actual answer, as others have pointed out, is to buy a new drive. If the data has any value, it's worth it.

If you read the comments on the StackExchange question, the scenario that prompted it does make sense:

>> I'm migrating our company's database and this is 700GB file with profile data in JSON format, each line one profile. The process that imported the data stopped after running several days after 70% done. Now, to not repeat the full import, I want to cut the first 70% (300 million lines).

>> [One option would be] exporting the data from the old DB again, skipping the first 300 Mio entries and cp it again over to the new and import it there. It roughly takes a day. I just thought there must be a way to do it in place: the data are already there - just truncate the first part of it and continue.

So it's not as if this is the only copy of the data. It would just be nice if there was a way to cut some of it without recopying everything.

In that case, surely it would be possible to point the database importer at a pipe, the other end of which is the output of `tail -n +300000000 db_dump.json`. No need to actually modify the file.

Or, if it's JSON, just replace the first 300M lines with spaces.

You probably don't care if it's 299M rows, so just dd over the start of the file, guessing the number of bytes to write and choosing a low number to be sure. You'll know if you're missing records when you run a count(*) on the database when it's loaded.

Use some hex editor (most support files larger than ram) to check you have the necessary opening square brace at the start of the file, and to remove whatever partial record is there.

Sure, it isn't elegant, but is a solution most people can come up with without needing to spend 20 minutes hunting through man pages.

Yeah, fair enough, that's a reasonable scenario.

I've found it curious that so few (any?) file systems allow for non-aligned starting blocks. Ie keeping an offset into the initial block. That would make prepending files rather trivial, as well as trimming the start of files (hello log files).

Of course performance would suffer due to non-aligned access, but just as for regular files one could demand aligned access for optimal speed.

Extent-based filesystems have no problem with that, and performance wouldn't be impacted either. The preserved tail of the first extent would have to copied / CoWed, that's it.


From the linked article:

>The immediate user for this operation would appear to be video editing applications, which could use it to quickly and efficiently remove a segment of a video file.

Do any of Linux video editor apps actually employ this feature?

Aren't all video editors "non-linear editing" these days anyway? All they mutate when you're editing is a small amount of metadata called edit decision list. The output file is only created when you explicitly execute a "render" operation, everything you see before that is just parts of the input files mixed as needed.

A segment of a video file isn't going to be block-aligned.

Deleting a non-block-aligned segment of a file means copying the remaining data over it and then truncating the file at the end using the ancient truncate function. It's not obvious how that can be improved.

Without getting into the details of video compression, you can't remove arbitrary parts of a video file without causing problems when the file is played.

Even if the video is uncompressed you will still need to remove entire frames and then - for most formats - reconstruct a file header at the start of the file.

This is a much harder problem than throwing away x% of a file without worrying about the structure of the content.

It's fairly easy to solve that problem at the disk block level. But if you're trying to keep the data structures error-free, you're going to have to do a lot more work.

the only answer i’d use here is “go buy another 1TB drive and then ...”

$100 and a day and a half for shipping is a much better idea than playing russian roulette with work data by adjusting files in-place.

> playing russian roulette with work data by adjusting files in-place

That can be avoided if your filesystem supports reflinks / copy_file_range (BTRFS, XFS, OCFS2). Just create a reflink of the file, drop all the blocks at the beginning of the reflink "copy" up to the one containing the line you want to start with using fallocate --collapse-range, and overwrite the remaining data up to the first line with some suitable fill character. For the JSON-encoded lines in this case spaces should work. Total disk space used: one data block for the padding plus some metadata for the reflink. No changes to the original file. No copying or relocating 700 GB of data. Quick, easy, and and safe. But you have to start with the right filesystem for it to work.

Very recently I've been in a similar situation with a dedicated server. There was no way to add storage and even getting another server would have been pretty expensive. Luckily it was a raid, so I just had to double the storage temporarily :)

I guess you weren't allowed to use a public cloud provider. Because you can always mount a network file system using sshfs, s3 like, or similar. Or upload the file on another server rented for one hour.

I guess I could have asked for the CC information, setup a new AWS account, and have them pay for the storage and egress of 1.5TB of data. However, that very likely would have lasted quite a while as well, and in my case I had to minimize downtime as well. So, no, I would say even if you're allowed to do something, sometimes it is not the best solution.

I'd use the answer 2 (but attempt to just compress the output first, depending on type of data, leaving original file alone)

That is a nice approach, I agree. But from next time it is better to use the name of the answerer or link rather than saying 1st or 2nd answer since the position may change anytime depending on upvotes.


I can't imagine unless s/he wants to accidentally lose the data any better alternative.

Even if this works perfectly, you never want to delete data you can't easily retrieve again.

Next Monday the client might change their mind

Yeah that's right. Even 2 TB drives don't cost $100 though.

Samsung 850 EVO 250GB 2.5-Inch SATA III Internal SSD (MZ-75E250B/AM) https://www.amazon.com/dp/B00OAJ412U/ref=cm_sw_r_cp_api_fab_...

Ok, $110.

Parent is agreeing with you. 2TB drives do not cost $100.


ah damn platters are cheap!

i forgot about not-SSD for a second there.

It’s not like you can’t test your script on another file first. If it works fine on a 1GB file with 300K lines, why wouldn’t it then work on the real file?

What would you do when the next time this happens it’s 300B lines from a 1PB file?

Backup is important even if you're confident.

This comes up from time to time and depending on luck you'll get people saying it's stupid and just buy another hard drive, or someone goes nerdy and figures out it can be done with conv=notrunc. The top answer uses that as well, just like I did when answering a previous instance of this question on stackexchange (that question was about compressing a file without having to store the compressed version in parallel -- same question different application https://unix.stackexchange.com/a/383019/30731).

The dd approach is interesting. Editing a file like that in place cowboy style with dd would fill me with anxiety however

Can anyone explain how dd is able to read and write a file at the same time without using a temporary file? :\

You can do it with a single filehandle, reading the data then seek()ing backwards before you do the writes. A Perl example that reads a file with the numbers 1 to 100 then replaces the digits with X's:

  open(my $fh,">foo") or die($!);
  for (1..100) {print $fh "$_\n"};
  open($fh,'+<','foo') or die($!);
  while(<$fh>) {
    my $spot=tell($fh)-length($_);
    seek($fh,$spot,SEEK_SET) or die($!);
    s/\d/X/g;#replace digits with X's
    print $fh "$_" or die($!);
If his file has lines of equal size, this would be fairly easy to do in place. Would be a little more complicated if not.

Number of file handles is irrelevant, all handles for the one file are accessing the same data.

Just pointing out it's not needed. Why is that irrelevant?

There's nothing inherently wrong with reading and writing to the same file (as least if it is a binary file). And if you read at offset x and write at offset x-y you can easily move contents around with only a very small buffer. Just be sure that you move in the correct direction, so that you dont override stuff you need later.

That people normally use temporary files, has two reasons: The first is that it is much safer if something goes wrong. You then still have the original and not some half-way processed file. The other reason is that reading and writing in text mode does not really work. But you can of course process text files as binary files and deal with line endings yourself.

it opens up two file handles, reads from the first and writes to the second.

That doesn't fully explain it but I think I got it, each line is overwritten with what it's place on the 300th line after.

Something like

  readLineOffset = 301
  writeLineOffset = 1
  While not read eof
    line = Read(readLineOffset)
    Write(WriteLineOffset, line)
    readLineOffset ++
    writeLineOffset ++

dd reads and writes in blocks of bytes, not lines. You have to figure out where 300th line starts beforehand.

More over, how dd deal with EOL ? and counting them ?

I believe it doesn't, and the solution in the OP first counts EOLs to get the byte offset you want, then tells dd what to do just in terms of byte offsets.

Assuming the data is at all important, "go buy another drive" is really the only valid solution here...

A well balanced team will have two types of engineers. One says "Here's the DD incantation that if uttered precisely right will produce the desired result."

And another engineer to say "Are you kidding me? Just buy another freakin' disk!"

Yes, and that team should have 1 guy saying DD it, and the rest saying 'buy a disk' even if most of them already thought to DD it.

It's like trying to fix an engine that's running. My god man, why would anyone try to do that, other than to put it on YouTube.

How many lines are in the 700GB file? What percentage of the file does the first 300M lines take up?

At work, we have lots of 20-30GB text files that gzip down to 14-15% of the original size. If this 700GB file was similarly populated, a gzip of the whole thing would be 100GB or so. Doing a tail piped to the gzip command would work.

Whats wrong with this:

1) scan to find 300m line offset N 2) shift file from N to 0 using reads+writes 3) truncate last N bytes

You can print out the current offsets in case of a crash you need to resume from. You read exactly the size of the file once and write it exactly once, and it's safe.

It's #1 answer. Nothing wrong except it isn't safe. Depending on type of crash you might not have a chance to see your printed offset.

If you print the offset after a write has safely completed, then it should still work. The internal copy is idempotent, up to the number of bytes in the first 300M lines. In other words, assuming reasonable line lengths, you can safely re-do a copy operation. So if you crash, you may not have printed out the last offset, but you have recorded a recent offset. You can then just restart from that recent offset.

The old fashioned way to deal with making sure you see your printed offsets is to actually print your offsets. You'd go to the machine room and run the command on the system console, which would typically be a hardcopy terminal like a DECwriter.

The trendy modern way, of course, would be to use the cloud. Upload the offsets to the cloud as they are produced, pausing until you get confirmation after each upload.

A more clever and/or foolhardy way is to put the crash recovery markers in the file itself. Place evenly spaced markers throughout the file, saving the data that they overwrite in a separate file. Then do your in-place file shifting. Finally, replace the markers with the saved data.

You can miss a printed offset and it still works. The file data exists in two places while writing the offset.

That is basically the answer in the OP, do you suspect there's something wrong with it?

Beaten to the punch :)

Disregarding the copy time, why not just find the 300M-th newline and move everything else down, then truncate?

What does "move everything else down" mean?

As someone not too familiar with precise system calls/file system internals, I just meant that if the "keep" part of the file starts at byte 1234, then start moving all of them at an above that location to locations at and above 0 (i.e to the beginning). And when the last byte is now at byte `last - 1234` then truncate the file.

One could write a tiny program to do that and I'd feel relatively safe running that.

(All that `dd` stuff would probably work too, and faster, but you'd need to spend time learning it first then. But that tool probably has discrepancies between OS'es, has history, etc. Can be simpler to just write the program.)

EDIT: and you can iterate in blocks of the size of the disregarded prefix.

I would guess memory map the whole file, then do the equivalent of:

memmove(beginning_of_file, beginning_of_file + 300_m_line_offset, file_size - 300_m_line_offset);

Why do you need to move everything down? Can't you just tell the filesystem that the file moved and changed size?

You just cat/pipe it to an Amazon s3 bucket (check out s3 command line tools), trimming the lines you don’t want, and piping through bzip2 as well if needed. Storage cost for an hour or so to hold in s3 it is quicker and cheaper than trying to muck around installing a second hard drive.

You’ll need a good connection, though!

I find myself imagining that one of the lines turns out to be more than half the file size, and picking a solution that fails to account for that possibility.

While it is fun to discuss the literal question given, this is very likely one of those "XY problems" [1]. You don't need a file with the first 300M lines cut off. You need a view of the file with the first 300M lines cut off. To the consumer of such a file, there is no difference between that and a file with the first 300M lines cut off. One way to produce that is by producing a literal file on disk with that, but it's not the best or even IMHO a good way.

The easiest way is to use "tail -n +300000001" as someone mentioned, but then, piping the result to where it needs to go. In a bad scenario, the target program requires a file, in which case you can set up a FIFO: https://www.howtoforge.com/linux-mkfifo-command/

In the worst case, the consuming program insists on random access to the file, in which case, finally, you have the problem that you need a real (OS-level) file. But as this is a database reload situation (expand the comments in the question to see this), this is unlikely to be the case.

(In a truly desperate situation, you could grab your favorite scripting language that has FUSE support and set up a special-purpose one-file FUSE filesystem that serves that one file with the front cut off and translates all seeks correctly. It's probably easier than you think, but certainly I'd have to be backed into a corner pretty hard to do this.)

The upshot is, files aren't files, or to put it another way, the file!interface isn't the same as files!disk. File!interface is an interface that allows you to read, write, seek, etc. and files!disk are a particular implementation of a bag of bytes on disk, and despite using the same name, it's important to keep them straight. The user doesn't need a file!disk that does what need, they need a file!interface that will serve the contents they need. It is much easier to get the file!interface than the file!disk in this situation.

Or to put it another another way, it's all just numbers. Files have no ontological existence, really. It's all just a question of what numbers will be returned by a "read". If you return the right numbers, you have the "right" file, regardless of how those right numbers appeared. I realize this may sound simple and obvious when I say it, but it is a major mental block that almost all junior programmers and a fair number of mid-level programmers have. It especially causes problems when you're a network programmer doing network things; you really need to understand that in the end, it's all just numbers, and as long as you consume the correct numbers, and produce the correct numbers, the right things will happen.

(The simplest case of this was when I produced a network server on port 80 that, if you poked it with the correct magic sequence, did the thing it was supposed to do, but if you did anything else, returned a hard-coded byte string that "happened" to be an HTTP response that said what you did wrong and how to fix it, completely ignoring the rest of the "request", including ignoring if it was even HTTP at all. You don't need an HTTP library to produce an HTTP response, you just need to produce the right numbers on the TCP stream. Nginx does something similar to detect when you're trying to use HTTP on an HTTPS port, so it returns a nice message instead of a bare binary SSL negotiation.)

[1]: http://xyproblem.info/ don't love the name, but it seems to be somewhat established terminology.

The user wants to avoid having to rewrite the file to disk

"This is one option. Exporting the data from the old DB again, skipping the first 300 Mio entries and cp it again over to the new and import it there. It roughly takes a day. I just thought there must be a way to do it in place: the data are already there - just truncate the first part of it and continue. But I'm not in a desperate position. I have backups and no (extreme) time pressure. So in this sense it's an exercise."

Solutions such as: compressing, splitting into smaller files, using dd would require re-writing most or all of the file so are not what the user is looking for

Using fallocate to change where the file starts from would appear to be the correct solution but it would leave some unwanted data at the start of the file.

The best solution would be to use fallocate and then overwrite the unwanted lines in the first block with newline characters.

In gdb, conditionally break on fopen() call, then use fseek to skip first 300M lines, then continue import.

Someone else who actually knows how to use debuggers; a dying breed!

Don't forget about intercepting each ftell() though...

That’s both awful and awesome.

Caveat: yes, the real answer is buying more disk because it's safest, but I have a couple of outrageous suggestions that I haven't seen floated:

1) custom FUSE filesystem that just skips the first N bytes of a file with a given name. Calculate N ahead of time by reading the file (you could calculate it on-demand but this will introduce a massive perf delay in some early file op that could cause a timeout for whatever software is reading the file)

2) (warning: this sounds insane but I've actually seen embedded devices that do this type of gymnastics in the name of not buying more storage space) if it's acceptable to temporarily have the file not be a text file, you could write custom code for sliding the data forward in the file by N bytes, starting with the N + (2*blocksize)th byte. The first two blocks are left until the end because you encode the progress of this operation into a state block, which includes a SHA of the state for validation purposes and which is written atomically, alternating between block 0 and 1 of the file. Your actual move of data must be designed in such a way that each data write is nondestructive if its state write is lost (i.e. it can only write to scratch space). Resume an interrupted move by figuring out which state validates and has the higher serial number. This is not something I'd attempt as a one-off. I'd want to be damn sure this code was well-tested.

Just for fun of course, but how about changing the low level data in the inode structure and having it point to a data block that is 300M lines past the start? Can this be done with debugfs, I wonder?

If the data is irreplaceable, the correct answer is: Get another disk.

Data isn't irreplacable, it takes ~24 hours to regenerate though, which is quicker than ordering a disk online.

Then do both, your successor will thank you! Or yourself, if you fuck up twice.

On mainframe, a one step JCL with a smart DFSORT and proper disk allocation would do this without blinking...

Oh, but it’s legacy. Better reinvent chunks of the wheel every 5 years.

vim <filename> 300000000dd :wq

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact