Ruby's HTTP Client improperly parsing certain URLs -
it quite possible there answer following question, if so, unable recognize it.
here's thing: making ruby program sweeping dictionary list of entries. need because want sweep each entry in search of specific words, that's beside point. problem program has trouble downloading data encoded links, has never occured before.
by encoded, mean encoding replacing non-ascii characters etc., link this: http://www.dict.cc/deutsch-englisch/a+%5bauch+a%5d+%5bbuchstabe%5d.html looks this: http://www.dict.cc/deutsch-englisch/a+%5bauch+a%5d+%5bbuchstabe%5d.html
the funny thing while above not work, of links work, instance: /deutsch-englisch/a+an+b+anpassen.html
i have tested random links , work, , regex matches supposed match.
here's function using:
def getdataoverhttpget(link, proxy = nil) link = uri.unescape(link) # added http = httpclient.new(:agent_name => 'mozilla/5.0 (windows nt 6.1; wow64; rv:12.0) gecko/20100101 firefox/25.0') http.proxy = proxy if proxy r = http.get(link) raise r.status.to_s if r.status != 200 return r.body end
which worked fine until now. has been suggested me urls might escaped http client, added unescape thing. got in return empty string instead of information program generates missing data (= failed match using regex). however, using uri.escape makes no change, might case. however, have no idea else can try.
also, strings in program in utf-8 (no bom).
try calling uri.parse, so:
http = httpclient.new(:agent_name => 'mozilla/5.0 (windows nt 6.1; wow64; rv:12.0) gecko/20100101 firefox/25.0') http.proxy = proxy if proxy r = http.get(uri.parse(link))
Comments
Post a Comment