java - Extract only the text from entity-encoded HTML inside an XML element -
i'm working on xml parsing android application have problem. page i'm going parse has element <description>
in there appears entity-encoded html not wanted. structure:
<description><img src="http://images.website.it/thumbs/images/2014/12/16/asd_crop_upscale_q85.jpg" alt="post img " style="float:left;margin-right:10px"/> lorem ipsum...</description>
what want lorem ipsum...
part , none of encoded html inside tag <description>
. doinbackground
part of asynctask
@override protected void doinbackground(void... args) { element e = null; xmlparser parser = new xmlparser(); string xml = parser.getxmlfromurl(url); // getting xml document doc = parser.getdomelement(xml); // getting dom element nodelist nl = doc.getelementsbytagname(key_item); // looping through item nodes <item> (int = 0; < nl.getlength(); i++) { // creating new hashmap hashmap<string, string> map = new hashmap<string, string>(); e = (element) nl.item(i); // adding each child node hashmap key => value map.put(key_title, parser.getvalue(e, key_title)); map.put(key_desc, parser.getvalue(e, key_desc)); map.put(key_date, parser.getvalue(e, key_date)); map.put(key_link, parser.getvalue(e, key_link)); string descriptionelementcontent = key_desc; textonly = removeentityencodedhtmltags(descriptionelementcontent); // adding hashlist arraylist menuitems.add(map); } system.out.println("prova3"); (int c = 0; c < nl.getlength(); c++) { e = (element) nl.item(c); titoli[c] = parser.getvalue(e, key_title); // name child value descrizioni[c] = parser.getvalue(e, textonly); date[c] = parser.getvalue(e, key_date); links[c] = parser.getvalue(e, key_link); } return null; } public string removeentityencodedhtmltags(string rawstring) { matcher tagmatcher = entity_encoded_html_tag.matcher(rawstring); return tagmatcher.replaceall("").trim(); } @override protected void onpostexecute(void s) { adapter = new simpleadapter(getactivity(), menuitems, r.layout.post_list_item, new string[] { key_title, textonly, key_date }, new int[] { r.id.title, r.id.description, r.id.date }); system.out.println("prova5"); listview.setadapter(adapter); pdialog.dismiss(); }
it's first time i've parsed xml don't know how can done.
this xmlparser
class way:
public class xmlparser { // constructor public xmlparser() { } /** * getting xml url making http request * * @param url string */ public string getxmlfromurl(string url) { string xml = null; try { // defaulthttpclient defaulthttpclient httpclient = new defaulthttpclient(); httppost httppost = new httppost(url); httpresponse httpresponse = httpclient.execute(httppost); httpentity httpentity = httpresponse.getentity(); xml = entityutils.tostring(httpentity); } catch (unsupportedencodingexception e) { e.printstacktrace(); } catch (clientprotocolexception e) { e.printstacktrace(); } catch (ioexception e) { e.printstacktrace(); } // return xml return xml; } /** * getting xml dom element * * @param xml string */ public document getdomelement(string xml) { document doc = null; documentbuilderfactory dbf = documentbuilderfactory.newinstance(); try { documentbuilder db = dbf.newdocumentbuilder(); inputsource = new inputsource(); is.setcharacterstream(new stringreader(xml)); doc = db.parse(is); } catch (parserconfigurationexception e) { log.e("error: ", e.getmessage()); return null; } catch (saxexception e) { log.e("error: ", e.getmessage()); return null; } catch (ioexception e) { log.e("error: ", e.getmessage()); return null; } return doc; } /** * getting node value * * @param elem element */ public final string getelementvalue(node elem) { node child; if (elem != null) { if (elem.haschildnodes()) { (child = elem.getfirstchild(); child != null; child = child.getnextsibling()) { if (child.getnodetype() == node.text_node) { return child.getnodevalue(); } } } } return ""; } /** * getting node value * * @param element node * @param key string */ public string getvalue(element item, string str) { nodelist n = item.getelementsbytagname(str); return this.getelementvalue(n.item(0)); } }
i editet quest new doinbackground
method. nothing change
if want discard of entity-encoded html found within description
element use regular expression find encoded html tags , replace them empty string, trim resulting string rid of unwanted leading , trailing spaces.
you can using string.replaceall(string regex, string replacement)
method if you're going more once it's worth creating pattern
object once , using each time. avoids java runtime having compile same regular expression more once.
i've tested code , works against example xml in question:
private static final pattern entity_encoded_html_tag = pattern.compile("<.*?>"); public static void main(string[] args) { string descriptionelementcontent = "<img src=\"http://images.website.it/thumbs/images/2014/12/16/asd_crop_upscale_q85.jpg\" alt=\"post img \" style=\"float:left;margin-right:10px\"/> lorem ipsum... <br />"; string textonly = removeentityencodedhtmltags( descriptionelementcontent); system.out.println(textonly); } public static string removeentityencodedhtmltags(string rawstring) { matcher tagmatcher = entity_encoded_html_tag.matcher(rawstring); return tagmatcher.replaceall("").trim(); }
simply replace main
method in above code own loop through description
elements.
the regex pattern <.*?>
says "match sequence <
, content until first occurrence of >
". pattern used instantiate single static pattern
object can used (over , on again) create matcher
object each raw string pass method. matcher.replaceall
method replaces every match found (in raw string) empty string rid of result, leaving text only.
Comments
Post a Comment