java - Extract only the text from entity-encoded HTML inside an XML element -

i'm working on xml parsing android application have problem. page i'm going parse has element <description> in there appears entity-encoded html not wanted. structure:

<description>&lt;img src="http://images.website.it/thumbs/images/2014/12/16/asd_crop_upscale_q85.jpg" alt="post img " style="float:left;margin-right:10px"/&gt; lorem ipsum...</description>

what want lorem ipsum... part , none of encoded html inside tag <description>. doinbackground part of asynctask

@override          protected void doinbackground(void... args) {            element e = null;            xmlparser parser = new xmlparser();            string xml = parser.getxmlfromurl(url); // getting xml            document doc = parser.getdomelement(xml); // getting dom element            nodelist nl = doc.getelementsbytagname(key_item);             // looping through item nodes <item>            (int = 0; < nl.getlength(); i++) {                // creating new hashmap                hashmap<string, string> map = new hashmap<string, string>();                e = (element) nl.item(i);                // adding each child node hashmap key => value                map.put(key_title, parser.getvalue(e, key_title));                map.put(key_desc, parser.getvalue(e, key_desc));                map.put(key_date, parser.getvalue(e, key_date));                map.put(key_link, parser.getvalue(e, key_link));                 string descriptionelementcontent = key_desc;                textonly = removeentityencodedhtmltags(descriptionelementcontent);                 // adding hashlist arraylist                menuitems.add(map);            }            system.out.println("prova3");            (int c = 0; c < nl.getlength(); c++) {                e = (element) nl.item(c);                titoli[c] = parser.getvalue(e, key_title); // name child value                descrizioni[c] = parser.getvalue(e, textonly);                date[c] = parser.getvalue(e, key_date);                links[c] = parser.getvalue(e, key_link);            }            return null;        }         public string removeentityencodedhtmltags(string rawstring) {            matcher tagmatcher = entity_encoded_html_tag.matcher(rawstring);            return tagmatcher.replaceall("").trim();        }          @override           protected void onpostexecute(void s) {             adapter = new simpleadapter(getactivity(), menuitems, r.layout.post_list_item,                      new string[]                               { key_title,                                 textonly,                                 key_date                                }, new int[]                                        {                                         r.id.title,                                         r.id.description,                                          r.id.date                                     });              system.out.println("prova5");             listview.setadapter(adapter);             pdialog.dismiss();                   }

it's first time i've parsed xml don't know how can done.

this xmlparser class way:

public class xmlparser {      // constructor     public xmlparser() {      }      /**      * getting xml url making http request      *      * @param url string      */     public string getxmlfromurl(string url) {         string xml = null;          try {             // defaulthttpclient             defaulthttpclient httpclient = new defaulthttpclient();             httppost httppost = new httppost(url);              httpresponse httpresponse = httpclient.execute(httppost);             httpentity httpentity = httpresponse.getentity();             xml = entityutils.tostring(httpentity);          } catch (unsupportedencodingexception e) {             e.printstacktrace();         } catch (clientprotocolexception e) {             e.printstacktrace();         } catch (ioexception e) {             e.printstacktrace();         }         // return xml         return xml;     }      /**      * getting xml dom element      *      * @param xml string      */     public document getdomelement(string xml) {         document doc = null;         documentbuilderfactory dbf = documentbuilderfactory.newinstance();         try {              documentbuilder db = dbf.newdocumentbuilder();              inputsource = new inputsource();             is.setcharacterstream(new stringreader(xml));             doc = db.parse(is);          } catch (parserconfigurationexception e) {             log.e("error: ", e.getmessage());             return null;         } catch (saxexception e) {             log.e("error: ", e.getmessage());             return null;         } catch (ioexception e) {             log.e("error: ", e.getmessage());             return null;         }          return doc;     }      /**      * getting node value      *      * @param elem element      */     public final string getelementvalue(node elem) {         node child;         if (elem != null) {             if (elem.haschildnodes()) {                 (child = elem.getfirstchild(); child != null;                         child = child.getnextsibling()) {                     if (child.getnodetype() == node.text_node) {                         return child.getnodevalue();                     }                 }             }         }         return "";     }      /**      * getting node value      *      * @param element node      * @param key string      */     public string getvalue(element item, string str) {         nodelist n = item.getelementsbytagname(str);         return this.getelementvalue(n.item(0));     } }

i editet quest new doinbackground method. nothing change

if want discard of entity-encoded html found within description element use regular expression find encoded html tags , replace them empty string, trim resulting string rid of unwanted leading , trailing spaces.

you can using string.replaceall(string regex, string replacement) method if you're going more once it's worth creating pattern object once , using each time. avoids java runtime having compile same regular expression more once.

i've tested code , works against example xml in question:

private static final pattern entity_encoded_html_tag =         pattern.compile("&lt;.*?&gt;");  public static void main(string[] args) {     string descriptionelementcontent =             "&lt;img src=\"http://images.website.it/thumbs/images/2014/12/16/asd_crop_upscale_q85.jpg\" alt=\"post img \" style=\"float:left;margin-right:10px\"/&gt; lorem ipsum... &lt;br /&gt;";     string textonly = removeentityencodedhtmltags(             descriptionelementcontent);     system.out.println(textonly);        }  public static string removeentityencodedhtmltags(string rawstring) {     matcher tagmatcher = entity_encoded_html_tag.matcher(rawstring);     return tagmatcher.replaceall("").trim(); }

simply replace main method in above code own loop through description elements.

the regex pattern <.*?> says "match sequence < , content until first occurrence of >". pattern used instantiate single static pattern object can used (over , on again) create matcher object each raw string pass method. matcher.replaceall method replaces every match found (in raw string) empty string rid of result, leaving text only.

Search This Blog

Print F

java - Extract only the text from entity-encoded HTML inside an XML element -

Comments

Post a Comment

Popular posts from this blog

node.js - How to mock a third-party api calls in the backend -

node.js - Why do I get "SOCKS connection failed. Connection not allowed by ruleset" for some .onion sites? -

matlab - 0-by-1 sym - What do I need to change in order to get proper symbolic results? -