Plea for advice to program a 4chan archiver -

Diogenes

kiwifarms.net
Hello, I just made an account to ask for some advice, as this program I want to make could be interesting for the website. It's a program which archives all images and stuff from a thread on 4chan and saves it. I'm going to proceed to describe it, and pray for someone to either give me some link to learn what I'm missing, or spoonfeed me:
1/ The program checks every thread of any amount of specific boards each tick time, which might range from seconds to around 10 minutes or so.
2/ The moment the program captures a keyword the user wants to find, which might be the name of a person or whatever you are interested in collecting, the thread will be monitored until its impending death, and saved. This means all text and full images are stored.
3/ The program can be turned on and off whenever the user needs and the threads are stored in order and by board.
As for now, my only programming skills have to do with mathematics one way or the other (see Project Euler), I've never made a web-crawler before and the only programming language I'm half decent at is Java. I request for some programming savant to show me the first step on my quest to make this program, which will be released on github for everybody to play with.
I'm also concerned with how slow it might be and the limits it might have to search for keywords, as 4chan holds a lot of threads and material at any given moment. I know it goes against the perishable philosophy of the website but I want to archive some parts. I'm sorry for the autism this posts shows, I hope someone helps me so I can make something worthwhile for this community.
 

Diogenes

kiwifarms.net
There's things for this already. Get out of here.
Most threads aren't archived quickly enough because those 4chan archivers hosted on the internet need an user to manually give the thread's URL, what I want is different because it makes sures that all content with a particular keyword is stored. The reason it wouldn't store every thread it's because a website with an userbase as 4chan would fill up TBs of data in a matter of hours
 

Diogenes

kiwifarms.net
Why would anyone want to archive shitposting? There are also already scrapers which do the same thing. However your ambition to archive is admirable. It's a lesson many kiwis should learn.
Some stories, images and info is lost, if you look up a particular keyword you will be able to filter out most shitposting and go to the point of whatever you're interested in finding. Although most of the regular content is cancer, there are some boards with good content such as /trv/, /out/, and boards such as /x/ are fun to read if you filter the succubus shit and generals
Anyway, if someone here knows of such a scraper that I can download and it's open-source so that I can modify it for my needs, it will make everything much easier. I've looked for a while and I don't know what to go with, I'm not a good finder
 

Diogenes

kiwifarms.net
these things already exist but, make one that goes to all boards, currently archive of 4chan only go to a handful.
Archiving 4chan is impossible unless you have hundreds of TB to spare, and that would only let you loose for a week or so. Another problem is reposted content such as YLYLs and copypasta/bait threads, which get the most attention and are downright shitposting. Bots make a big amount of posts every hour, and all that info would be useless to archive. Maybe incorporating a neural network or an algorithm such as /r9k/'s that filters out unoriginality would be usable to archive the web, as most redundant shit would be thrown away
 

Cthulu

Satan's little helper
True & Honest Fan
kiwifarms.net
I still have no clue to why you would want to archive /4chan especially the boards you mentioned. Doesn't seem very funny tbh. However if you PM the staff I'm sure they would help. Include them all.
 

Diogenes

kiwifarms.net
I still have no clue to why you would want to archive /4chan especially the boards you mentioned. Doesn't seem very funny tbh
It seems useful, maybe someone could search all times his city's name has appeared in a 4chan post, or if someone is stalking, a person's name might be useful. Also, maybe you like a particular topic very much and would like to read all stuff on the topic on the whole website, such as lucid dreaming. I see too many possibilities, I know it would be a very useful tool for myself, at least while 4chan keeps struggling to be alive
 

Hell0

i cut my hair. i run downstairs.
kiwifarms.net
Archiving 4chan is impossible unless you have hundreds of TB to spare, and that would only let you loose for a week or so. Another problem is reposted content such as YLYLs and copypasta/bait threads, which get the most attention and are downright shitposting. Bots make a big amount of posts every hour, and all that info would be useless to archive. Maybe incorporating a neural network or an algorithm such as /r9k/'s that filters out unoriginality would be usable to archive the web, as most redundant shit would be thrown away

eh, fair enough dude good luck
 
Top