Programming thread -

Least Concern

Pretend I have a waifu avatar like everyone else
kiwifarms.net
Anyone have a good webscraping tool?
Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.

XPath isn't that hard; you can think of it as alternative CSS syntax if that helps.

You can experiment with XPath in real time by installing a libxml package. That will give you a tool called xmllint which, among other things, will let you run XPath queries against files and print the result.

Of course, all this presumes your input isn't too soupy…
 
  • Winner
Reactions: SickNastyBastard

SickNastyBastard

The Alpha Troon: Order 66 Pisstroopers
True & Honest Fan
kiwifarms.net
Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.

XPath isn't that hard; you can think of it as alternative CSS syntax if that helps.

You can experiment with XPath in real time by installing a libxml package. That will give you a tool called xmllint which, among other things, will let you run XPath queries against files and print the result.

Of course, all this presumes your input isn't too soupy…
Sweet, thanks bro.
 

Considered HARMful

kiwifarms.net
Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.
Serious question: what percentage of webpages parse as a valid XML? I'd bet some coin that very few. A lot of the "HTML" isn't even valid HTML...

Don't forget learning regular expressions to fiddle with all the things that prevent a page from loading as XML. (Unclosed tags such as IMG are a big one)
Regular expression for handling context-free languages is asking for trouble.
 

Kosher Dill

Potato Chips
True & Honest Fan
kiwifarms.net
Regular expression for handling context-free languages is asking for trouble.
Well yeah, don't try to actually handle the language that way.
When I was scraping Twitter, just nuking out the IMG elements and a few others like INPUT and SOURCE, plus unescaping some escaped characters, was enough to get the whole thing to parse as XML.

C#:
xml = Regex.Replace(xml, @"<img[^>]+>", "");
And so on.

(EDIT: and yes, this would have broken if some mongoloid had put <img> in there. I guess a * would have done fine.)
 
Last edited:

Least Concern

Pretend I have a waifu avatar like everyone else
kiwifarms.net
Serious question: what percentage of webpages parse as a valid XML? I'd bet some coin that very few. A lot of the "HTML" isn't even valid HTML...
Well, fewer than in the days when most pages were done by hand, I'd wager. If you're dealing with a professional site, they'll be running tests and such to notify them if their CMS or storefront or whatever is outputting bad HTML. But on the other hand, some XML parsers also have a "soup" mode which will have some tolerance for malformed XML, and pretty much all of them will gladly tell you where and how things go wrong if they just give up - so you can use regex, yes, to fudge the source into something more correct before loading it.
 

SickNastyBastard

The Alpha Troon: Order 66 Pisstroopers
True & Honest Fan
kiwifarms.net
I'm starting my BERT stuff tonight, I need a lot of specific exchanges and trading of ideas to try and utilize to build a model,. I'm learning as I go. I'm using BERT+Tensorflow. I'm scraping the information and creating a data set for training with the most hardcore troons to ever exist. I will be #1 troon with deep learning. Then everyone will have to say I'm a #1 woman.

If anyone has any cool noob tips, I'd be glad to have em.
 

Considered HARMful

kiwifarms.net
Well, fewer than in the days when most pages were done by hand, I'd wager. If you're dealing with a professional site, they'll be running tests and such to notify them if their CMS or storefront or whatever is outputting bad HTML. But on the other hand, some XML parsers also have a "soup" mode which will have some tolerance for malformed XML, and pretty much all of them will gladly tell you where and how things go wrong if they just give up - so you can use regex, yes, to fudge the source into something more correct before loading it.
In the past ~10 years I've soured on the concept of "be liberal in what you accept, conservative in what you send". I strongly believe we would live in a vastly superior world if the browsers were allowed, nay - mandated! - by RFC to outright reject invalid input. I'm sad that the XHTML movement effectively went nowhere.
 

Kosher Dill

Potato Chips
True & Honest Fan
kiwifarms.net
Here's one I want to toss out to the audience - I was checking in on our perpetual laughingstock, the Covid simulator, and I happened across this diff:

In C++, what earthly reason is there to replace a "=default" constructor with a blank one that explicitly calls the superclass' default constructor? It seems weird that someone specifically added this in after a review - am I missing something?
 

Ledj

kiwifarms.net
In C++, what earthly reason is there to replace a "=default" constructor with a blank one that explicitly calls the superclass' default constructor? It seems weird that someone specifically added this in after a review - am I missing something?
I looked over the source code a bit; the change was likely for consistency of style across the constructors because semantically nothing has changed. The only scenario where that change actually accomplishes something is if the base type were trivial and (for whatever reason) you wanted to disable the derived type's trivial/aggregate trait.

It's kind of astonishing that they took the time to reexamine the constructors yet managed to make the absolute worst possible refactor. Vector2 should have mirrored Min/Max's =default instead, because Vector2 is needlessly disabling its trivial trait solely because of its default constructor.
 

emuemuemu

kiwifarms.net
Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.
If you want to go down that road then you need a purpose built library like nokogiri or beautiful soup . You can't just feed html into a plain xml parser, I've tried. It works sometimes, but not well enough for a robust scraper you want to keep running.

Regular expression for handling context-free languages is asking for trouble.
You're not trying to parse the html language though. It's just a long string you want to extract some specific values from. Regular expressions are exactly what you want.
The problem with fancy things like xpath and css selectors is that some trivial change or error in the page no where near the data you actually want will fuck up your whole scraper.
Again, based on my experience, curl + grep is all you need for an effective scraper that won't keep breaking.
 

Least Concern

Pretend I have a waifu avatar like everyone else
kiwifarms.net
You're not trying to parse the html language though. It's just a long string you want to extract some specific values from. Regular expressions are exactly what you want.
The problem with fancy things like xpath and css selectors is that some trivial change or error in the page no where near the data you actually want will fuck up your whole scraper.
Again, based on my experience, curl + grep is all you need for an effective scraper that won't keep breaking.
I'd say that a regex might have the same brittleness problems if the page structure changes, but you're right in that regex alone might be good enough depending on what you're trying to scrape.

I remember hearing from a colleague about an unscrapable site… the creators of the site were so intent that their data not be scraped that they did some really crazy things to avoid it, like having ids and class names be random strings that changed with every page load and having every element be wrapped in a number of meaningless divs that varied on each page load… If you're facing chaos like that, using regex instead of XPath is gonna be your only possible approach.
 
  • Agree
Reactions: Kosher Dill

Marvin

Christorical Figure
True & Honest Fan
kiwifarms.net
XHTML wasn't an actual superset of HTML and neither vice versa, if I remember correctly. Unless that's changed sometime recently?
 
  • Agree
Reactions: Dandelion Eyes

Least Concern

Pretend I have a waifu avatar like everyone else
kiwifarms.net
XHTML wasn't an actual superset of HTML and neither vice versa, if I remember correctly. Unless that's changed sometime recently?
XML and HTML have common roots but had slightly different rules (if you could say early HTML had any rules at all). XHTML is an effort to write HTML which conforms strictly to the rules of XML such that if you had an XML parser, you also had an HTML parser. The big difference is that singleton tags such as <br> are valid like that in HTML but have to be self-closed in XHTML, e.g. <br /> - which is how I instinctually write HTML at this point anyway. So no, HTML is not strictly a subset of XHTML.
 

Considered HARMful

kiwifarms.net
XML and HTML have common roots but had slightly different rules (if you could say early HTML had any rules at all). XHTML is an effort to write HTML which conforms strictly to the rules of XML such that if you had an XML parser, you also had an HTML parser. The big difference is that singleton tags such as <br> are valid like that in HTML but have to be self-closed in XHTML, e.g. <br /> - which is how I instinctually write HTML at this point anyway. So no, HTML is not strictly a subset of XHTML.
Besides, HTML allows for stuff such as not needing to close particular block tags, for example <p>. Whenever you start a new paragraph with a <p>, the previous one is implicitly closed.
 

Least Concern

Pretend I have a waifu avatar like everyone else
kiwifarms.net
Besides, HTML allows for stuff such as not needing to close particular block tags, for example <p>. Whenever you start a new paragraph with a <p>, the previous one is implicitly closed.
That didn't sound right to me, at least for HTML5, so I looked it up, but at least according to the venerable developer.mozilla.org, you're right.

If you're on my team and I catch you writing HTML like that, I'm calling you into my office, though.
 

HolocaustDenier

MEEEEEEEE REEEEEEEEEEEEEE
kiwifarms.net
If anyone has any cool noob tips, I'd be glad to have em.
i made this simple script with beautifulsoup 4 html parser few years ago, scraps the "most popular searches" names from mercadolibre and saves them as a csv file with todays date as filename in dataframe format...
we want to scrap every <li class="searches__item"> name inside the big box <andes-card searches>, using page_soup.findAll() we do that...
1596592432310.png


1596592658849.png

you can make sure the number of items makes sense using len
1596595010331.png

1596593709523.png
from datetime import date
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://tendencias.mercadolibre.cl'
uClient = uReq(my_url)
page_html = uClient.read()
page_soup = soup(page_html,"html.parser")
masusadas = page_soup.findAll("li",{"class":"searches__item"})
today = str(date.today())
filename = today + ".csv"
f = open(filename, "w")
for i in range(len(masusadas)):
listado = masusadas.text
f.write(listado + " ,")
f.close()

1596593754292.png

not the most useful or original scrapper idea but it helped me understand beautifulsoup, you can adapt it to scrap w/e you want in the format you want, just change my_url and the variable masusadas = page_soup.findAll("li",{"class":"searches__item"}) for w/e you want to scrap, cant upload the ,py ,pynb files here :/
 

Attachments

SickNastyBastard

The Alpha Troon: Order 66 Pisstroopers
True & Honest Fan
kiwifarms.net
i made this simple script with beautifulsoup 4 html parser few years ago, scraps the "most popular searches" names from mercadolibre and saves them as a csv file with todays date as filename in dataframe format...
we want to scrap every <li class="searches__item"> name inside the big box <andes-card searches>, using page_soup.findAll() we do that...
View attachment 1496244

View attachment 1496257
you can make sure the number of items makes sense using len
View attachment 1496326
View attachment 1496275
from datetime import date
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://tendencias.mercadolibre.cl'
uClient = uReq(my_url)
page_html = uClient.read()
page_soup = soup(page_html,"html.parser")
masusadas = page_soup.findAll("li",{"class":"searches__item"})
today = str(date.today())
filename = today + ".csv"
f = open(filename, "w")
for i in range(len(masusadas)):
listado = masusadas.text
f.write(listado + " ,")
f.close()
not the most useful or original scrapper idea but it helped me understand beautifulsoup, you can adapt it to scrap w/e you want in the format you want, just change my_url and the variable masusadas = page_soup.findAll("li",{"class":"searches__item"}) for w/e you want to scrap, cant upload the ,py ,pynb files here :/
Thanks bro, new teaching myself this stuff has been a real uphill battle for a faggot retard like me. Thank you for your help.
 
  • Feels
Reactions: HolocaustDenier
Tags
None