Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How would you chunk a large Excel file?
22 points by codingclaws on May 26, 2024 | hide | past | favorite | 44 comments
Let's say you had an Excel file with 10,000 rows and you wanted to break it up into many Excel files each with 500 records. Each new file should have the header fields from the original. How would you do it? I did it by writing a node script but I'm wondering if there's an easier way.

Edit: Guys this is just an example. I'm looking for a general solution. It could be 10 million rows.




You can use my SQL spreadsheet app: https://superintendent.app

I had a similar problem at work where I needed to do some formula on a 5 GB CSV file. Excel can't handle more than 1M rows. Database through command line is too clunky. I did try to split the CSV into multiple files before but using formula on top of multiple files isn't easy. Eventually I built a Desktop GUI wrapper on SQLite, and it grew into Superintendent.app (now powered by DuckDB).

The newest version supports "Ask AI", which can be used for "Ask your CSVs anything" and "Ask AI to write/rewrite SQLs for you". It has been pretty fun to use and tell AI to "Expand *", "format all dates to something like May 3, 2024", and etc.


I am quite enjoying the different approaches people are recommending. Good job.


This looks great. Well done.


Thank you!


Write a script. I know there are decent api for excel and files.

I dont know the api or recommended scripting language. This would be a good case for chatgpt or equivalent type task. Enough to get started.

edit: I asked chatgpt, it recommended python and 'pandas' for interacting with excel

    python
    import pandas as pd
    # Load the data from an Excel file, assuming headers are in the first row by default
    data = pd.read_excel('path/to/your/file.xlsx')
    # Define the number of records per chunk
    chunk_size = 500
    # Split the data into chunks and write each chunk to a new Excel file
    for i in range(0, len(data), chunk_size):
        # Extract the chunk of data based on the current index range
        chunk = data.iloc[i:i + chunk_size]
        # Write the chunk to a new Excel file, including headers as column names
        chunk.to_excel(f'output_{i // chunk_size + 1}.xlsx', index=False)
I asked about the first 'row', and it claims panda includes that in each chunk, but I don't know about that. It's at least a place to start to iterate from. Would need to iterate further with real code/tests.


Pandas is absolutely the way. The code you gave looks ok.


Seconding Pandas. I used it just the other day for this very kind of task, it really makes this kind of thing straightforward.


I would use R. Phoneposting now but something like

  library(tidyverse)
  library(readxl)
  library(writexl)
  read_excel("file.xlsx") %>%
    group_by(group_id =     row_number() %/% 20) %>%
    group_walk(~ write_xlsx(.x, paste0("file_", .y, ".xlsx")))

edit: updated to write xlsx instead of csv


I would save my data in CSV format, then use this. Save the code below as chunk.pl (remove leading spaces) and call it as "perl chunk.pl" :

    #!/usr/bin/perl -CSD -w -Mstrict -Mwarnings -MText::CSV
    
    # chunk.pl -- split csv files into chunks
    
    # Usage message and exit if needed
    if (!@ARGV || $ARGV[0] eq '-h') {
        print "Usage: $0 input_csv [chunk_size] [output_filename_format] [separator]\n";
        print "Example: $0 input.csv 500 'input-%08d.csv' ','\n";
        exit;
    }
    
    # Set command-line arguments
    my ($INFILE, $CHUNKSIZE, $FMT, $SEP) = @ARGV;
    $CHUNKSIZE //= 500;
    $FMT //= "data-%08d.csv";
    $SEP //= ",";
    
    # Initialize CSV, file handles, and counters
    my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, sep_char => $SEP, eol => "\n" });
    my ($i, $f, $out) = (0, 1, undef);
    open my $in, "<:encoding(UTF-8)", $INFILE or die "Cannot open $INFILE: $!";
    
    # Main loop
    while (my $row = $csv->getline($in)) {
        if ($i % $CHUNKSIZE == 0) {
            close $out if defined $out;
            open $out, ">:encoding(UTF-8)", sprintf($FMT, $f++) or die "Cannot open output file: $!";
        }
        $csv->print($out, $row) or die "Failed to write row: $!";
        $i++;
    }
    
    # Clean up: close file handles
    close $out if defined $out;
    close $in;


I suppose R might be a better choice...

    library(readr)
    library(dplyr)
    library(purrr)
    data <- read_csv("input.csv")
    chunk_size <- 500
    chunks <- split(data, ceiling(seq_along(1:nrow(data))/chunk_size))
    iwalk(chunks, ~write_csv(.x, sprintf("data-%04d.csv", .y)))


Untested but something like this in VBA should suffice:

    Dim rng as range: set rng = Sheet1.UsedRange
    Dim rows as Long: rows = rng.rows.count
    Dim cols as long: cols = rng.columns.count
    Const SIZE as long = 500
    For i = 2 to rows step SIZE
      Dim wb as workbook: set wb = workbooks.add()
      wb.sheets(1).range("A1").resize(1, cols).value = rng.resize(1).value
      wb.sheets(1).range("A2").resize(SIZE,cols).value = rng.offset(i-1).resize(SIZE).value
      Call wb.SaveAs("C:\Temp\" & i & ".xlsx")
    next


Can you confirm that said file is purely static data? Chunking a file like that is one thing. Handling one with cell references is a different animal.


I would naturally do it in Emacs Lisp for its superior excel reading abilities

    (require 'csv-mode)
    
    (let ((input-file "bigfile.xls")
          (output-dir "chunked/")
          (chunksize 1000))
      (cl-labels ((read-csv-file (filename)
                    (with-temp-buffer
                      (insert-file-contents filename)
                      (csv-mode)
                      (csv-parse-buffer t)))
                  (write-csv-file (filename data)
                    (with-temp-buffer
                      (csv-mode)
                      (dolist (row data)
                        (csv-insert-list row))
                      (write-region (point-min) (point-max) filename)))
                  (chunk-data (data chunk-size)
                    (let (result)
                      (while data
                        (push (cl-subseq data 0 (min chunk-size (length data))) result)
                        (setq data (nthcdr chunk-size data)))
                      (nreverse result))))
    
        (let* ((data (read-csv-file input-file))
               (chunks (chunk-data data chunk-size))
               (file-count 1))
          (dolist (chunk chunks)
            (let ((output-file (expand-file-name (format "chunk-%04d.csv" file-count) output-dir)))
              (write-csv-file output-file chunk)
              (setq file-count (1+ file-count))))
          (message "Processing complete. %d files created." (1- file-count)))))
    

(This is a joke. Do not use this.)


This guy even put his joke disclaimer in parens, GAH!


The C# interop API will surely work, but I wouldn't claim it's user-friendly.

I would strongly consider dumping rows into SQL, for more "natural" selection.

[1] https://learn.microsoft.com/en-us/dotnet/api/microsoft.offic...


With something in Excel that needs to chunk, Python is good, but also Go and Rust are good for these situations. ChatGPT is a good starting point to build out some boilerplate.

For readable/not binary files, there are standard tools like split, awk, etc.


That's only 20 cut and paste operations? I would certainly do that manually before trying to code it. And I'm a proficient coder, and not a proficient Excel user.

If it had 100k rows, I'd be out of my depth, so I'd hit google.


I’m not at my machine to whip up an example but this is an ideal use case for pandas. You can read or stream the excel file and split it in probably <20 lines of python.


> Guys this is just an example. I'm looking for a general solution. It could be 10 million rows.

It can't be. Excel's maximum row limit is only ~1 million.


You've missed the point of that explanation. It could have been 10M spread over 10 excel files with that much. The whole reason they made that edit as people were telling them to do the 20 copy and pastes by hand.


It's been well over a decade since I last dealt with Excel, but I remember you could actually query the data with SQL without opening the file, like you would with any flat-file db. If the size is the problem. It was poorly documented but I'd done it a few times and it worked really well. The best part being it was simple, fast and worked even with locked files. Otherwise I don't understand the question.


Yes, Microsoft.ACE.OLEDB.##.0


How you split millions of rows can be very different than shorter files, especially when you run into issues with file input/output issues in different languages.

Opening and working on an excel file with a few million rows can need a bit more ram than anticipated especially based on it's size/complexity.

The quickest way I'd start with is to convert it to a CSV, read the file in, and rewrite it out 500 lines at a time.


Not knowing what the data is like; I'd save it in a comma- or tab-separated text format. From there, it's just a few lines of bash.



xsv is great. I quite like the fork called qsv as well; it has some features that were helpful to avoid some piping I didn't want to do. There are a lot of other additions. https://github.com/jqnatividad/qsv


I don't understand what you mean by easier way, because it's always going to be a script. Unless you mean some low code/no cool tool that was already tailor made to do this.

If you're worried about data not fitting in memory, then stream the file. It seems like the Java API has support for this, surely other languages do too.


Does each chunk have to execute properly? Do you have to minimize chunk size by maximum working inter-cell references? Can you split huge chunks with too many references by an intermediate chunk with stage based value propagation?

This turns into a massive graph theory nightmare problem if there are lots of references going on


High level concept:

Xlsx or xls?

If it’s xlsx, stream the file, chunking 500 rows into each new file, watching for ref A1, then injecting that into previous files.

The xlsx format is a zip file of xml, not so bad to work with once you start.

I should note, this works best if you don’t care about the order of the rows. They technically aren’t required to be in order (I think).



Seems like a 30 second LLM prompt for a Python script.


With VBA? VBA has everything you need without any conversion or external libraries. Is there some reason you don't want to use it?


This is so quaint. I love it.

Not even going to ask AI. Waiting for a bash one-liner before OP reminds us they are on corporate Windows machine.


Then a powershell one liner would do it.


One liner is a file of any size without the 10 or 13 ascii codes?


VBA macro in the original excel file.


Can wall clock time be proportional to file size or is a parallelized approach needed?


Don't put it in Excel


It's Excel, so just write a macro?


Why would you want to break it up?

If the input is not a spreadsheet, use a tool appropriate to the scale of the data store. Provide an api for access from Excel if end users need Excel.

If input is an Excel but the integrity of the data as a spreadsheet doesn’t matter,

If it is an Excel already, Excel provides end users with access from Excel. If the Excel is likely to outgrow Excel, use a tool appropriate to the scale of the data. Provide an api for access from Excel if end users need Excel.

If it-is-Execl is your problem, breaking it up into multiple Excels is slicing the magic broom. Good luck.

Good luck.


convert to csv, Next is simple text processing.Final ,convert csv to excel


Not to be that guy, but this is the kind of task that LLMs are great at assisting with.


This is a classic xyproblem [0], excel is almost never the answer when you’re dealing with “big” data, you’re almost always better off getting the data in a csv or db format and working on it from your favorite scripting language.

[0] https://xyproblem.info/


Bet it took you more time.to write that code than if you'd done it manually.

If you actually had to do it 'properly' there are actually a ton of options:

  - do it old school with a VBA macro
  - use the newer js macro stuff
  - xslx files are just a zip file of XML so could just do it in pretty much any language




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: