Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to process 100gb tsv and XML files?
2 points by anindha 10 months ago | hide | past | favorite | 6 comments
I am trying to parse a music data file that is close to 100gb. What app or programming language is best for handling a file like this?

Thanks!




It really depends on what you need to do with the data, but in most cases Python could do this pretty easily with csv.reader (with a \t delimiter for TSV) or xml.etree.ElementTree.iterparse (for XML) in streaming fashion such that you're not loading the whole file at once.


You can leverage ClickHouse to process your music data. ClickHouse supports both TSV[0] and XML[1] data formats.

[0] https://clickhouse.com/docs/en/interfaces/formats#tabseparat...

[1] https://clickhouse.com/docs/en/interfaces/formats#xml



What kind of single music data file is 100gb?

Also how is it structured? If it's actually a tab separated value file, consider using something like polars or DuckDB?



For TSV, you might wanna consider importing it into a Sqlite database, then querying it however you please.

https://stackoverflow.com/a/35454070/5298150

You can also use datasette & sqlite utils for it

https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: