You.Can’t.Parse.HTML.with.Regex
Well, I’ve been away for ages haven’t I? I’ve been a little busy, and truth be told, the well of inspiration has been a little dry for a while now. Anyhoo, I decided that I needed to *just start* writing again and the ideas will begin to flow. Today’s entry is about humour. In fact, it is downright hilarious.
In programming circles, parsing HTML with Regular Expressions is considered a bad idea. “Exactly how bad?”, you ask. Bad enough that one user on Stack Overflow says it’ll bring about the end of the world. Or something like that. You can’t really make sense of the crazy in the later bits!
“You can’t parse [X]HTML with regex. Because HTML can’t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will ins tantly transport a programmer’s consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection will devour your HT ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi ght he com̡e̶s, ̕h̵i s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liq uid pain, the song of re̸gular expression parsing will exti nguish the voices of mortal man from the sp here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t he f
inal snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ich or permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s͎a̧͈͖r̽̾̈́͒͑en ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ”
The rant had me literally rolling on the ground convulsing in laughter. Good Stuff. Thank you Jeff Atwood, for showing me this rant!
P.S: I have some posts lined up, so don’t go deleting me from your RSS just yet
. From now on, I’ll try to keep it one post per week!
Revenge
[Note: This post is not about C. Its not about linux. This post is related to Front End Engineering. If you're not interested, you may skip this post.]
Yesterday I was chatting with one of my friend. Like all other people who don’t understand the difference between Java and Javascript, he was cursing Javascript for its browser behaviors. To answer him, I told him one story. A story that was created real-time based on known facts. It goes as follows -
“Long long time ago, when the web, ‘the internet’ was young, browsers and web page authors knew only one language to communicate, that is – hypertext markup language (html). As it was new, neither authors nor browsers were fluent in speaking that language. Thus these browsers would just read whatever its written in that .html page n somehow try to render it.”
“After few years a community called w3c came up with standards for html and said ‘whatever you are doing is crap. do it this way’. Browsers agreed. They released newer versions with these standards enabled. The problem here was, browsers now were trying to render page considering html file is according to standards.”
“Obviously enough, all older pages broke apart because those were not standardized. Now the funny thing here is, people blamed these new browsers, instead of web page authors. To survive in such conditions, browsers started one technique called ‘quirk’s mode’ which will show page in older version on browser if its non standard page. Nice nifty trick does the job. Everyone was happy now.”
“Few years later pages started using javascripts, css. And before any standards standardize these new things, browsers adopted these techniques as they thought these should work. Some browsers supported same functions in different ways. “
“Now after these many years, HTML, JS, CSS all these things are standardized now.. But browsers are still not adopting these standards. The main difference in two cases is, now people are blaming language/programmers now, and not the browsers that are actually causing these things to happen”.
This, I call, is browsers revenge on programmers.
- x -
Note: Ok. Agreed. JS is not perfect. but as Douglas Crockford says – If you neglect the bad parts of it saying they do not exist, Then all that remain are The Good Parts.
Wammu – Phone manager for Linux
Wammu seems to have all the features of MyPhoneExplorer and is quite good. I especially liked the phone connect guide. The UI needs a bit of tweaking , but overall it seems to be quite a cool app. I had blogged about Multisync earlier and I think Wammu is better.
Ubuntu users can click here to install it. Its also available for Windows. You can find screens here. The list of supported phone is available here.
Console Junkie: Conky Makes Your Desktop Awesome!
What is conky?
Conky is a light-weight system monitor, which can display any information you want on the desktop. You can get it here, or you can simply install it with:
sudo apt-get install conky
If you want my conky setup though, you should compile it from source. I’ve explained why further in the post, so read through the whole thing before you go setting up your own conky. Here’s what my desktop looks like:
As you can see, I show music stats on the right hand side. I use mpd for music. Now, conky has in-built mpd support, which means using these conky variables for mpd make it faster and lighter on resources. However, these variables are disabled by default in the version available in Ubuntu repos. You won’t be able to use them if you do a sudo apt-get install conky. Hence the compile-from-source bit. If you are not using mpd for music, you might as well do a sudo apt-get install. If you are compiling from source, this is an excellent guide.
So that’s done. Next you will need to download my conky config files and other scripts required for the setup. You can download them from here. Extract them and rename the folder as scripts. I keep all my scripts at /home/vedang/Source/scripts/, and this path is hardcoded into the scripts everywhere. Please search for the string and change it appropriately.
Read more…
Treat Warnings as Errors
If you are reading this blog and have even a iotic sense of humour, you must have heard (and liked) this joke:
——
A man is smoking a cigarette and blowing smoke rings into the air. His girlfriend becomes irritated with the smoke and says, “Can’t you see the warning on the cigarette pack? Smoking is hazardous to your health!”
To which the man replies, “I am a programmer. We don’t worry about warnings; we only worry about errors.”
——-
Ofcourse, I laughed my ass off when I first heard the joke. Awesome it was! But then, the last month totally changed my perception. (Probably because I faced the side-effects of that).
So, ARM architecture is very specific about alignment. It hates mis-aligned access (Its the side-effect of being an arch for embedded systems). Here is an example of unaligned access:
1 short short_arr[4];
2 long *long_ptr = short_arr;
3 blah = *long_ptr;
since the data type of short_arr is "short", gcc forces only a 16-bit alignment on the array (on 32-bit systems). but then, long_ptr is probably 32-bit and that screws up alignment for line 3. Although, i386 would work perfectly fine (but require one extra memory cycle), ARM will just give garbage data (if misaligned).
And many many many software packages are filled with such errors. Just to give you an example:
# cat /proc/cpu/alignment
User: 1265555
(this file is present only an ARM architecture because of ARMs peculiar alignment requirements)
See the number of alignment errors?
Packages like mhash (which calculate hash) build fine but fail run-time, this makes such errors difficult to track down.
So, always make it a point to compile your C programs with "-Werror" option. It treat warnings as errors. It is *very* important for your programs to work on just about any platform. Please pay attention to warnings, because warnings are there FOR A REASON. gcc developers aren’t stupid!
(Although sometimes, you might be sure that even if gcc warns about misalignment, the access is aligned. eg, when you typecast an IP struct into a MAC struct, etc. These are the *only* times you can let go off the -Werror option)
But, until this point sinks into the mind of every developer, there’s no option but to live in this mess.
