Sort very large text files




















Why does everyone on SO feel so impelled to guess all the time? You can do multiple passes on the input - you just need to read all the input, write it to disk, and then sort the disk file.

Neil - from the context it seemed obvious that he was trying to sort the contents of the file not the file name which for one name is meaningless. I just wanted to improve the question without changing the context too much so that it would get answers instead of downvotes because of a simple mistake. Show 3 more comments. Active Oldest Votes. Matthew Matthew 3, 2 2 gold badges 18 18 silver badges 9 9 bronze badges.

Add a comment. Taz 3, 2 2 gold badges 34 34 silver badges 55 55 bronze badges. Sergio Sergio 1 1 silver badge 2 2 bronze badges. This is excellent. Wasn't aware that there was a parallel package!

I tried to use comm for diff on the files generated by this and its giving me warning that files are not sorted. Alexis Wilke Adrian Adrian 5, 10 10 gold badges 43 43 silver badges 68 68 bronze badges. You can just use sort --parallel N as of GNU sort version 8. GNU coreutils 8.

This one did the trick for me. I have sort 8. But if we bring data to swap space then the data can be read directly without an operating system call which is magnitude of times slower. The other advantage of MMAP is, multiple processes can share the same file thread-safe. In Java, we have the classes for this in NIO package.

FileChannel is the class to hold the data. Read this 8KB data and store it in a TreeSet. Repeat times, and store the 40 MB data in a temporary file we will have 10 temporary sorted files.

Then apply k-way merge. Using this approach, the program ran in 17 secs with a 5 secs gain. Posted by Unknown at PM. Labels: external sort , in-memory sort , k-way merge , sort a very large file. Unknown September 15, at AM. Pia kapoor March 16, at PM. Unknown February 1, at PM. Unknown May 6, at PM. Zilin September 4, at PM. Newer Post Older Post Home. Doughnut Destroyer wrote: Technically you should be able to sort all of them. To expand on the original info I was given and with my testing.

It seems that the issues are connected specifically to filtering and NOT sorting. User is trying to look through VIN numbers and have noted that the majority of the entries would be non-recurring. I was able to get it to work albeit a bit slowly, but with so many different options for the filtering could that cause crashes?

SullyTech wrote: Hi See this link for the full constraints of Excel Sully Thanks for the info Sully, it is appreciated. The data that is being looked through by our client are database dumps that they are trying to pull specific types of info from.

While it's probably not the best way to handle things, this is how 'they' wanted it done specifically. Sully, The information is being given to them from another client who has pulled the database info and handed it off to them in the Excel file.

Y Feb 8, at UTC. Gerard wrote: If it is a data dump from a DB then I presume that there are no formulae. Big thanks to everyone who provided info and advice, it is very much appreciated! Asked 11 years, 11 months ago. Active 11 months ago. Viewed 78k times. Points: We process thousands of flat files in a day, concurrently. Memory constraint is a major issue. We use thread for each file process. We don't sort by columns. Each line record in the file is treated as one column. We cannot use any database system no matter how light they can be.

Erika Gomez Erika Gomez 1 1 gold badge 5 5 silver badges 4 4 bronze badges. Is there a reason you can't use a database system? DBs are made for scenarios like this because they are so efficient at sorting through large amounts of data. Erika: how is introducing a light-weight not-installed database different from introducing a custom-written program that does whatever you wants but is not as well-tested?

Both are "changing the system", technically. Try feeding this to an executive who doesn't know sh! If you can sell it to him, you are my mentor! Email him the link to this thread :D — keyboardP. Show 1 more comment. Active Oldest Votes. Anuj Garg 6 6 silver badges 22 22 bronze badges. From my research, what I understood is that if you have records in a file and you read at a time, sort that and put the sorted version in a temp file which will create 10 temp sorted files.

Then read two files sequentially and create another sorted larger now file and delete the other two which just have been read. Continue until you have one file. Now, say you have 10 million records in a file and you read at a time, how many temp you created and how much time it's gonna cost you to get the final version?

An external sort is always slower compared to an in-memory sorty, but you are no longer limited by your ram. It does an external sort where all individual sort operations can happen on many machines in parallel.

Erika: When you merge the sorted, smaller files, you can have more than two open, it's just slightly more straightforward describing the algorithm using just two temp files. But, if you need a file taht's larger than the available memory sorted, you'll have to eventually do it that way anyway and the merging operation is relatively fast, as all it needs to do is keep N file pointers open and find the lowest of N "next record"s to know what to emit next.

I guess the critical piece of tuning is choosing how many records to keep in each temporary file. Add a comment. You do the merging by taking the two smallest temporary files and merge them to one larger temporary file. I would like to explain this with my own words differs on point 3 : Read the file sequentially, process N records at a time in memory N is arbitrary, depending on your memory constraint and the number T of temporary files that you want.

Sort the N records in memory, write them to a temp file. Loop on T until you are done. Advantages: The memory consumption is as low as you want. You only do the double of disk accesses comparing to a everything-in-memory policy.

Not bad! Choose to have temp files, so read and sort 10 records at a time, and drop these in their own temp file. Open the temp file at a time, read the first record in memory. Compare the first records, write the smaller and advance this temp file.

Loop on step 5, one million times. Community Bot 1 1 1 silver badge. Erika Well, it is an example, so that we get the idea.



0コメント

  • 1000 / 1000