How to Sort a list of strings and find the 1000 most common values in java

In java (either using external libraries or not) I need to take a list of approximately 500,000 values and find the most frequently occurring (mode) 1000. Doing my best to keep the complexity to a minimum.

What I've tried so far, make a hash, but I can't because it would have to be backwards key=count value =string, otherwise when getting the top 1000, my complexity will be garbage. and the backwards way doesn't really work great because I would be having a terrible complexity for insertion as I search for where my string is to be able to remove it and insert it one higher...

I've tried using a binary search tree, but that had the same issue of what the data would be for sorting, either on the count or the string. If it's on the string then getting the count for the top 1000 is bad, and vice versa insertion is bad.

I could sort the list first (by string) and then iterate over the list and keep a count until it changes strings. but what data structure should I use to keep track of the top 1000?

Thanks

Jon Skeet
people
quotationmark

I'd separate this into three phases:

  • Count word occurrences (e.g. by using a HashMap<String, Integer>)
  • Sort the results (e.g. by converting the map into a list of entries and ordering by value descending)
  • Output the top 1000 entries of the sorted results

The sorting will be slow if the counts are small (e.g. if you've actually got 500,000 separate words) but if you're expecting lots of duplicate words, it should be fine.

people

See more on this question at Stackoverflow