
Ask HN: What is the best way to get slice of a bigdata set present in txt files? - sunilkumarc
I have 20 .txt files each having records ranging from few hundred thousand records to a couple of million records. All these files are interdependent. For example if we take person.txt and address.txt, I have person_id field in person.txt and all the corresponding address can be found in address.txt with person_id as foreign key in address.txt file.<p>My question is what is the best way to generate a slice of the whole data set?<p>For example lets say we have person_id from 1 - 1 million in person.txt file and lets say i want to get all the related data in all the files for person_id = 1 - 10. What would be the best approach?<p>Finally i want to generate 20 new .txt files which contain data for those 10 person.<p>Two approaches I can think of are:<p>1. Store in the data from all the .txt files in different tables. Then write different queries to get data from all the related tables by joining them.<p>2. Write a python script and use different tools like grep, cut etc within the script to get the data slice. For example, first I can run grep on person.txt file to get all the records in person.txt file. Then run grep on other files(using person_id if that is the foreign key in the other file or some other field from the result I got from previous grep).<p>Approach 1 will give the advantage of indexing in database which will make reading from tables faster.<p>On the other hand writing a python script will avoid the overhead of creating a database, storing all the data in the database and joining.<p>I&#x27;m not sure which his the best way to get better execution time for my requirement. Any help is appreciated. Thanks.
======
unoti
I’d reach for SQLite. It’s a great and fast way to do the joins. It also would
make for a much better way to transport this data than a collection of
interrelated text files. If the data will be transmitted and refreshed
periodically this will make for a much more scalable, extensible, and smooth
process.

------
PaulHoule
It depends on your case.

I can say that pandas is amazingly good and fast at joining, if your data is
really huge you can parallelize/segment your data with dask.

If it relatively simple to do the joining then it will probably be quicker to
process the .txt files and not hard to write the script.

If the complexity is high you might want to try sqllite.

------
fiddlerwoaroof
Spark makes processing data in files really easy: at work I’ve been using this
to read in a bunch of events and generate various metrics from them.

