I'm trying to scrape data from the website Squawka.com. For example, when I'm trying to scrape data from: http://www.squawka.com/teams/chelsea/stats#performance-score#english-barclays-premier-league#season-2014/2015#126#all-matches#1-7#by-match I'll use this code:
HttpClient client = new DefaultHttpClient();
String url = "http://www.squawka.com/teams/chelsea/stats#performance-score#english-barclays-premier-league#season-2014/2015#126#all-matches#1-7#by-match";
String urlEncode = "http://www.squawka.com/teams/chelsea/stats" + URLEncoder.encode("#", "UTF-8") + "performance-score" + URLEncoder.encode("#", "UTF-8") + "english-barclays-premier-league"+ URLEncoder.encode("#", "UTF-8") +"season-2014/2015"+ URLEncoder.encode("#", "UTF-8") +"126"+ URLEncoder.encode("#", "UTF-8") +"all-matches"+ URLEncoder.encode("#", "UTF-8") +"1-7"+ URLEncoder.encode("#", "UTF-8") +"by-match";
HttpGet get = new HttpGet(urlEncode);
HttpResponse response = client.execute(get);
HttpEntity entity = response.getEntity();
String content = EntityUtils.toString(entity);
System.out.println(content);
As you can see, the hash sign # is illegal (which gave me the IllegalArgumentException).
So I decided to encode the url using URLEncoder
, which is my second variable urlEncode
. But using this variable, it requests another url, namely
which returns other data.
So my question is:
How should I change my code in order to get the data from the right url (Variable String url
)
Thanks in advance.
Everything beyond the #
is the fragment identifier. It's not sent to the server as part of the request - in this case it would be used by the Javascript on the page to perform extra filtering.
When fetching the page programmatically, you just need to fetch http://www.squawka.com/teams/chelsea/stats
- that will get the same data down to the browser as the original link... but you'll then need to work out what the Javascript would have done with the fragment identifier in order to get to the right data within the page (possibly making more requests).
See more on this question at Stackoverflow