Programming thread -

melty

True & Honest Fan
kiwifarms.net
I also want to make webscrapers. The use cases are scraping all product pages from a website, or scraping pages that match search keywords. What do people think about Scrapy? I don't know anything except Python, which I don't really know either. But I'm trying to learn.

If anyone needs a simple scraper without programming, there's a Chrome add-on called Instant Data Scraper that works pretty well in a lot of cases. It works best when you have a series of pages with a consistent "next" button, and it dumps the data into a .csv.
 
  • Like
Reactions: 3119967d0c

3119967d0c

"a brain" - @REGENDarySumanai
True & Honest Fan
kiwifarms.net
I also want to make webscrapers. The use cases are scraping all product pages from a website, or scraping pages that match search keywords. What do people think about Scrapy? I don't know anything except Python, which I don't really know either. But I'm trying to learn.

If anyone needs a simple scraper without programming, there's a Chrome add-on called Instant Data Scraper that works pretty well in a lot of cases. It works best when you have a series of pages with a consistent "next" button, and it dumps the data into a .csv.
Depends on exactly what you want to do, and the complexity of the sites involved.

You might find it profitable to look into Selenium, which allows automation of most modern browsers. Very valuable when dealing with sites that have more sophisticated JS involved. There's a basic extension for Firefox or Chrome that has a native recording/scripting function. That basic extension is really just set up for testing web sites, checking for the presence of specific elements by XPath, checking values within them, etc, but you can export the scripts it generates for browsing to particular URLs, clicking on elements, etc, to Python and other languages. And if you do that, you can mess around and interactively test the stuff you're trying to extract from DOM from the Python prompt. Invaluable.
 

SickNastyBastard

Gaslight > Greenlight > Glow Bright
True & Honest Fan
kiwifarms.net
Aight my niggas, this happened 2 day:
fagernetes.PNG


That bricks my cluster manager and my containerization strategy, I'm not using tech from shitbag bbc aficionados.

I leaning towards Apache Mesos since I am using Apache Pulsar as the broker. I figured an easier integration, has high availability and fault tolerant. I looked at Docker, Docker Compose but they weren't fault tolerant and docker looked to handle a node failure by spinning up another one. I don't like how mapping is done with Marathon with Mesos. The other alternatives I found either didn't run on my didn't fit the needs I chose k8 for in the first place. And I can't find much on how well it would work my RDBMS. I can't do a nosql thing since I'm not going through the work I'd need to for establishing the analytics I need with a non-relational DB

what kind of shit is getting brokered:
Flink/FlinkML for stream handling with a BERT model
Zookeeper/Bookeeper
Hashcat Cracker Program
Gateway
SQL
Tensorflow/BERT Trainer.
 

cecograph

kiwifarms.net

SickNastyBastard

Gaslight > Greenlight > Glow Bright
True & Honest Fan
kiwifarms.net
There's a UK comic called Stewart Lee who's a card carrying leftie who once did this, and it was funny because, at the time, the concept was ridiculous:


Nowadays, it's completely normal for abstract entities such as your container-orchestration system to be opposed to racism.
I gotta admit thats golden and funny.

UPDATE: Gotta go with kubernetes, unfortunately. My clusters are going to pay homage to klan leaders. Mesos is way to complicated and the use of marathon is retarded, nomad just isn't what we need and docker swarm/docker compose aren't fault tolerant.

Good news though! I can get rid of pulsar, zookeeper and bookkeeper and handle what I need to with a redis cache and solid priority queues. That just made my project less complicated by leagues.
 

SickNastyBastard

Gaslight > Greenlight > Glow Bright
True & Honest Fan
kiwifarms.net
No, it's just that normal people say "bricks" when they mean "rendered completely inoperable due to a software or firmware fault", not "my political views prevent me from using this software".
Its not my political views, its the fact they openly are supporting a money laundering organization that is linked to rioting and burning shit down in my country. I've worked as a dev so I'm aware of the emasculated faggotry, this is just a whole new level. /my fag opinion

Anyone got good suggestions on structuring containers/VM for dev work? I typically just don't like using them as a personal choice as I have found them slow at one point. With the redis cache going, I want to be performant but not use up all my computers resources so if I could get a ball park from anyone who has built up a biggish cache would be great.
 

Marvin

Christorical Figure
True & Honest Fan
kiwifarms.net
Depends on exactly what you want to do, and the complexity of the sites involved.

You might find it profitable to look into Selenium, which allows automation of most modern browsers. Very valuable when dealing with sites that have more sophisticated JS involved. There's a basic extension for Firefox or Chrome that has a native recording/scripting function. That basic extension is really just set up for testing web sites, checking for the presence of specific elements by XPath, checking values within them, etc, but you can export the scripts it generates for browsing to particular URLs, clicking on elements, etc, to Python and other languages. And if you do that, you can mess around and interactively test the stuff you're trying to extract from DOM from the Python prompt. Invaluable.
Selenium is an important tool to have in your toolbox if you're doing scraping stuff, because, as you note, dipshits are creating websites that consist of <html><head><script src="my_site_lol.js"></head><body></body></html> and leaving it at that.

But it's probably something to keep as a backup and use only if one of the simpler, existing scraping libraries (like beautifulsoup) don't work to begin with.

Sorta related, but awhile ago I wrote a small script using phantomjs (same idea as selenium; headless browser engine in a scripting language) to render tweets. Though alas, phantomjs development got suspended a few years ago.
 

SIGSEGV

Segmentation fault (core dumped)
True & Honest Fan
kiwifarms.net
I'm just going to shill RapidJSON here, because it's an extremely solid C++ library that's served me very well.
Pros:
  • It's fast as fuck
  • It doesn't use up a lot of memory
  • As long as you avoid obvious data races, you can safely use it in multithreaded code
  • It supports in place parsing
  • It doesn't have any dependencies outside of the C++ standard library
  • It's compatible with C++11 onwards
  • You can #define certain macros before you #include the headers to enable some quality of life features like support for std::string
Cons:
  • It uses PascalCase instead of snake_case, which might take a bit of getting used to
  • Copying/moving strings from one Document to another doesn't work properly if you need to access the target Document after the original Document object goes out of scope unless you use SetString (see below for a function I wrote to make this easier)
Things that are both pros and cons:
  • It's a header only library, which is great for portability but also means that everything is implicitly inline
  • NDEBUG must not be defined during testing, because RapidJSON uses assertions to check whether you fucked up instead of throwing exceptions. If you fuck up (by, e.g., trying to access a field that doesn't exist or getting the data type wrong) and NDEBUG is defined, you'll just get a segmentation fault. The lack of exceptions makes your code faster and smaller in release builds, however.
C++:
//dest_alloc should be obtained by calling GetAllocator() on the rapidjson::Document object which contains dest
void copy_string_field(rapidjson::Value& dest, const rapidjson::Value& src, rapidjson::Document::AllocatorType& dest_alloc){
    //if src is not JSON null
    if(!src.IsNull()){
        assert(src.IsString());
        std::string_view data_str(src.GetString());
        assert(dest.IsObject());
        dest.SetString(data_str.data(), data_str.length(), dest_alloc);
    } else{
        assert(dest.IsObject());
        dest.SetNull();
    }
}
 

Aidan

kiwifarms.net
I also want to make webscrapers. The use cases are scraping all product pages from a website, or scraping pages that match search keywords. What do people think about Scrapy? I don't know anything except Python, which I don't really know either. But I'm trying to learn.

If anyone needs a simple scraper without programming, there's a Chrome add-on called Instant Data Scraper that works pretty well in a lot of cases. It works best when you have a series of pages with a consistent "next" button, and it dumps the data into a .csv.
Not exactly programming, and programming web scrapers is out of my league atm, but would a good old fashioned wget script be out of the question for your needs?
 
  • Informative
Reactions: melty

Least Concern

Pretend I have a waifu avatar like everyone else
kiwifarms.net
Not exactly programming, and programming web scrapers is out of my league atm, but would a good old fashioned wget script be out of the question for your needs?
Web scraping involves extracting data from a web page. wget will let you download the HTML file, but doesn't do anything that would help you with extracting data from it. You still need to parse that HTML in some way or other. In easy cases you can load that HTML into an XML parser and run some XPath queries against it, or maybe even get away with a regex or two. In more difficult cases, this might involve using a web engine to actually render the page and do things like execute JavaScript to add things to the DOM which might not be in the HTML itself. wget won't help you much for either of those tasks.
 
  • Agree
Reactions: Marvin

Kosher Salt

(((NaCl)))
kiwifarms.net
In more difficult cases, this might involve using a web engine to actually render the page and do things like execute JavaScript to add things to the DOM which might not be in the HTML itself. wget won't help you much for either of those tasks.
If it's AJAX-based, it may be relatively simple to skip the rest of the page altogether and just grab the XML or JSON resource that has the data you need.

You can use your browser's network monitor to see if there are any XHR resources being loaded and what their contents are. If that's where the data is, and you can hit the right URL directly (without any complicated process like logging in first), it's probably going to be easiest to just fetch that and parse it.
 

Shoggoth

kiwifarms.net
I'm just going to shill RapidJSON here, because it's an extremely solid C++ library that's served me very well.
Pros:
  • It's fast as fuck
  • It doesn't use up a lot of memory
  • As long as you avoid obvious data races, you can safely use it in multithreaded code
  • It supports in place parsing
  • It doesn't have any dependencies outside of the C++ standard library
  • It's compatible with C++11 onwards
  • You can #define certain macros before you #include the headers to enable some quality of life features like support for std::string
Cons:
  • It uses PascalCase instead of snake_case, which might take a bit of getting used to
  • Copying/moving strings from one Document to another doesn't work properly if you need to access the target Document after the original Document object goes out of scope unless you use SetString (see below for a function I wrote to make this easier)
Things that are both pros and cons:
  • It's a header only library, which is great for portability but also means that everything is implicitly inline
  • NDEBUG must not be defined during testing, because RapidJSON uses assertions to check whether you fucked up instead of throwing exceptions. If you fuck up (by, e.g., trying to access a field that doesn't exist or getting the data type wrong) and NDEBUG is defined, you'll just get a segmentation fault. The lack of exceptions makes your code faster and smaller in release builds, however.
C++:
//dest_alloc should be obtained by calling GetAllocator() on the rapidjson::Document object which contains dest
void copy_string_field(rapidjson::Value& dest, const rapidjson::Value& src, rapidjson::Document::AllocatorType& dest_alloc){
    //if src is not JSON null
    if(!src.IsNull()){
        assert(src.IsString());
        std::string_view data_str(src.GetString());
        assert(dest.IsObject());
        dest.SetString(data_str.data(), data_str.length(), dest_alloc);
    } else{
        assert(dest.IsObject());
        dest.SetNull();
    }
}
Thoughts on SIMDJson?
 
  • Like
  • Thunk-Provoking
Reactions: Tookie and SIGSEGV

SIGSEGV

Segmentation fault (core dumped)
True & Honest Fan
kiwifarms.net
Thoughts on SIMDJson?
This is the first time I've heard of it. It doesn't look like simdjson allows you to modify the parsed DOM, which is something that I need to do in the project where I'm using RapidJSON. The API that I'm pulling from via libcurl doesn't give me JSON that I can immediately add to our database. It has to be formatted in a very specific way, and there's also a bunch of extra data that we don't care about. There are also a couple of fields that use numeric IDs, and I have to convert those into the proper string. Several fields also need to be renamed in the output document because they're nested several layers deep in the API response, and the name isn't as descriptive/meaningful when you take it out of the nested context. RapidJSON lets me solve all of these issues and more.
 

RandomTwitterGuy

kiwifarms.net
So i have had to start learning a bit of Python for a project. Mostly so i can read others code.

I mostly work on PLC programming with structured texts in as my base and then some C++ follow op for things that needs doing. I can also do stuff in Java, but that was a long time ago. So trying to learn Python is not hard in any way, but it kind of scares me.

It is way to free form and honestly it feels like i am making up sudo-code. I get why people like it, it is fast to write in. How ever as i am to used to the limitations of a PLC and C++ this new free form do what ever you want scares me and i fear my code will be a lot more "spaghetti" then normal if i where to use python.
 

Spedestrian

Shitposting in Ring Zero
True & Honest Fan
kiwifarms.net
So i have had to start learning a bit of Python for a project. Mostly so i can read others code.

I mostly work on PLC programming with structured texts in as my base and then some C++ follow op for things that needs doing. I can also do stuff in Java, but that was a long time ago. So trying to learn Python is not hard in any way, but it kind of scares me.

It is way to free form and honestly it feels like i am making up sudo-code. I get why people like it, it is fast to write in. How ever as i am to used to the limitations of a PLC and C++ this new free form do what ever you want scares me and i fear my code will be a lot more "spaghetti" then normal if i where to use python.
Here are a few things I'd recommend.
  • Get an IDE like PyCharm or a linter for your text editor of choice, e.g. the Anaconda plugin for Sublime Text. That'll give you some guardrails so you don't go too crazy.
  • Get the IPython interpreter. It's got a shitload of useful features like session history, tab completion, and the ability to easily pull up documentation for any object by just adding a question mark to its name. It's good for making messy prototypes of features that you can clean up later once you've got all your pieces laid out.
  • Remember that exceptions are your friend. In C++ exceptions are expensive and you look before you leap, but in Python exceptions are cheap and it's easier to ask forgiveness than permission. Your code can and should contain a shitload of try . . . except blocks: they're actually meant to be used as flow control in Python, and they provide some nice visual structure to make things feel less chaotic to boot.
  • If your old habits work then keep using them. Just because Python lets you play fast and loose doesn't mean that you have to. I still define most of my constants at the top of the file in Python because it's convenient and it makes sense to me. You can probably find a way to do most of the stuff that C++ requires in Python, and if that makes things easier for you then definitely do it. It'll make your life easier if you need to switch back to C++ too — you won't be relying on a bunch of freeform Python shit that doesn't fly in other languages.
And of course...
PythonNoScareOfSnek.jpg
 
Tags
None