How to archive a complete Facebook timeline, when archive.md fails? -

CrusaderKangs

kiwifarms.net
I'm trying to archive an undiscovered lolcow with 10000+ Facebook timeline posts and images, 99% of which are on-topic. The cow's profile is visible to any logged-in Facebook account. But archive.md only gets a "404 not found" page. Simple solutions like full-page screenshots and "Save Page As" also fail, since Facebook only loads timeline posts to the bottom of your screen.

What's the best website/tool for this job? Given the number of posts, any solution will probably have to save each post separately. I'd prefer HTML archives over screenshots if at all possible for curating purposes, to preserve its links, and to make copying its text easier.
 
U

UN 474

Guest
kiwifarms.net
I'm not sure if there is a tool specifically for downloading facebook timelines. You could write your own simple python program to crawl through the facebook page or use HTTrack website archiving tool. But the way facebook loads timelines is a little weird, so I'm not sure HTTrack would work. It's worth a try though.

Secondly, I'd recommend making a thread in the Proving grounds with all amusing information on this "lolcow". I'd gladly help dig up some information.
 
Last edited by a moderator:

BlancoMailo

True & Honest Fan
kiwifarms.net
K, current plan is to get a list of URLs to every timeline post with some JS, then feed those into some extension that screenshots them.

And I will definitely post to Proving Grounds shortly, but I want a preliminary archive first. This cow is a niche sort of crazy, and he may scare and mass-delete.

Be sure to record your process when you get a working method down, it can help to create a standardized procedure for these situations in the future.
 

CrusaderKangs

kiwifarms.net
Be sure to record your process when you get a working method down, it can help to create a standardized procedure for these situations in the future.

Halfway there. Here's the userscript I made to get post URLs. Install in Tampermonkey, since Greasemonkey sucks nowadays. Control via 3 userscript menu items: Start, Stop, and Copy (for intermediate results, in case the tab crashes). It jumps to Y=0, then to Y=bottom, then waits 1s for FB to load, then copies all the post URLs and deletes their elements, then repeats. It stashes its data at "window.pwn" if you want to mess around with it. Firefox only: Google's spyware crashes after ~2 GB of RAM (~1500 posts). Easily broken by FB code changes: every time FB changes the CSS class for LINK_CLASS, you get Zucked.

Code:
// ==UserScript==
// @name         Timeline Cat
// @namespace    http://example.com/
// @version      0.1
// @description  Get all posts.
// @author       You
// @match        https://www.facebook.com/*
// @grant        unsafeWindow
// @grant        GM_registerMenuCommand
// @grant        GM_setClipboard
// ==/UserScript==

let getLinks = () => {
    // get page-specific container:
    let ca = unsafeWindow.document.getElementById('contentArea')
    // get container of posts:
    let sc = ca.querySelector('#timeline_story_column > [id^=timeline_story_container]')
    // prune 2 useless children:
    try {
        sc.querySelector(':scope > #recent_optimistic_video').remove()
        sc.querySelector(':scope > #timeline_section_stories_pagelet_container').remove()
    } catch (_) { }

    let linkFromPost = post => {
        let LINK_CLASS = '_5pcq'
        try { return post.querySelector('a.' + LINK_CLASS).href }
        catch (_) { return 'http://example.com/error/missing' }
    }
    let harvestPosts = container => {
        let childs = Array.from(container.children)
        let links = childs.map(linkFromPost)
        childs.forEach(child => child.remove())
        return links
    }
    if (window.pwn === undefined) {
        window.pwn = { urls: [], pollInt: null, pollPeriod: 1000 }
        unsafeWindow.pwn = window.pwn
    }
    let pwn = window.pwn
    let poll = () => {
        let links = harvestPosts(sc)
        pwn.urls = pwn.urls.concat(links)
        unsafeWindow.scrollTo(0, 0)
        unsafeWindow.scrollTo(0, document.body.scrollHeight)
    }
    if (!Number.isInteger(pwn.pollInt)) {
        pwn.pollInt = setInterval(poll, pwn.pollPeriod)
    }
}

let stopLinks = () => {
    try {
        clearInterval(window.pwn.pollInt)
        window.pwn.pollInt = null
        console.log(window.pwn.urls)
    } catch (_) { }
}

let copyLinks = () => {
    GM_setClipboard(JSON.stringify(window.pwn.urls, null, 2))
}

let deleteRefs = () => {
    try { unsafeWindow.document.querySelector('#pagelet_bluebar').remove() } catch (_) { }
    try {
        Array.from(unsafeWindow.document.querySelectorAll('div._43u6')).forEach(c => c.remove())
        Array.from(unsafeWindow.document.querySelectorAll('div._4efl')).forEach(c => c.remove())
    } catch (_) { }
    try {
        Array.from(unsafeWindow.document.querySelector('#rightCol').children).forEach(c => c.remove())
    } catch (_) { }
try { unsafeWindow.document.querySelector('#pagelet_dock').remove() } catch (_) { }
}

GM_registerMenuCommand('Get links', getLinks, 'g')
GM_registerMenuCommand('Stop links', stopLinks, 's')
GM_registerMenuCommand('Copy links', copyLinks, 'c')
GM_registerMenuCommand('Delete refs (care, does not always work)', deleteRefs, 'd')
 
Last edited:

CrusaderKangs

kiwifarms.net
The proper way to solve the second part is to write a browser extension that takes a list of URLs, and iteratively opens each, prunes the trailing identifying divs from the comment box (currently "div._3w53"), then uses html2canvas to screenshot "#contentArea".

Or alternatively, load each URL and save the whole page contents via a shim account.
 
U

UN 474

Guest
kiwifarms.net
then waits 1s for FB to load

Good work! I'm not a Javascript programmer, thus I'm unsure if it supports coroutines.
I would use coroutines to make absolutely sure if that section has loaded. Something may go wrong while looping through all those posts and fuck up the entire process.
 

CrusaderKangs

kiwifarms.net
Good work! I'm not a Javascript programmer, thus I'm unsure if it supports coroutines.
I would use coroutines to make absolutely sure if that section has loaded. Something may go wrong while looping through all those posts and fuck up the entire process.

JS has coroutines but they're a PITA in this context. You'd have to dig into FB's obfuscated code to find the events to await. No need either. If nothing new has loaded yet, I just wait another second. "container.children" returning empty means nothing.
 

CrusaderKangs

kiwifarms.net
It turned out that the cow's posts were public, even though his timeline was private, so I piped them into archive.md's API with the script below. It reads a JS array of URLs from "pwn.json", maps them over archive while printing progress, and saves a JS array of pairs of strings at "pwn.pairs.json".

Code:
#!/usr/bin/env python3

import archiveis
import json
import time

with open('pwn.json') as fd:
    urls = json.load(fd)

skip_to = 0
pairs = []
try:
    for (i, url) in enumerate(urls):
        if skip_to > 0:
            skip_to -= 1
            continue
        print(i)
        arch = archiveis.capture(url)
        pairs.append([url, arch])
        print(pairs[-1])
        time.sleep(1.0) # REQUIRED to avoid getting IP-banned from the API for DDOS
except Exception as e:
    print('You are being throttled. Increase your sleep above 1.0 or you may get IP-banned.')

with open('pwn.pairs.json', 'w') as fd:
    json.dump(pairs, fd, indent=2)

Note that archive.md doesn't get comments for some reason, uses the Tatar language for FB, and has an annoying popup asking you to join FB. I recommend blocking the latter with the cosmetic filter,

Code:
archive.fo##div:nth-of-type(3) > div:nth-of-type(1) > div > div:nth-of-type(2) > div:nth-of-type(2) > div:nth-of-type(1) > div > div:nth-of-type(1)
archive.fo##div:nth-of-type(3) > div:nth-of-type(1) > div > div:nth-of-type(2) > div:nth-of-type(2) > div:nth-of-type(1) > div > div:nth-of-type(1)

Unfortunately after about 1,000 archives, archive.md got Zucked, and FB now presents it with "Are you a robot?" instead of any content.
 
U

UN 474

Guest
kiwifarms.net
It turned out that the cow's posts were public, even though his timeline was private, so I piped them into archive.md's API with the script below. It reads a JS array of URLs from "pwn.json", maps them over archive while printing progress, and saves a JS array of pairs of strings at "pwn.pairs.json".

Code:
#!/usr/bin/env python3

import archiveis
import json
import time

with open('pwn.json') as fd:
    urls = json.load(fd)

skip_to = 0
pairs = []
try:
    for (i, url) in enumerate(urls):
        if skip_to > 0:
            skip_to -= 1
            continue
        print(i)
        arch = archiveis.capture(url)
        pairs.append([url, arch])
        print(pairs[-1])
        time.sleep(1.0) # REQUIRED to avoid getting IP-banned from the API for DDOS
except Exception as e:
    print('You are being throttled. Increase your sleep above 1.0 or you may get IP-banned.')

with open('pwn.pairs.json', 'w') as fd:
    json.dump(pairs, fd, indent=2)

Note that archive.md doesn't get comments for some reason, uses the Tatar language for FB, and has an annoying popup asking you to join FB. I recommend blocking the latter with the cosmetic filter,

Code:
archive.fo##div:nth-of-type(3) > div:nth-of-type(1) > div > div:nth-of-type(2) > div:nth-of-type(2) > div:nth-of-type(1) > div > div:nth-of-type(1)
archive.fo##div:nth-of-type(3) > div:nth-of-type(1) > div > div:nth-of-type(2) > div:nth-of-type(2) > div:nth-of-type(1) > div > div:nth-of-type(1)

Unfortunately after about 1,000 archives, archive.md got Zucked, and FB now presents it with "Are you a robot?" instead of any content.

Have you thought about adding slight randomization to the timings? It may trick facebooks servers into thinking it's a normal person. God, I absolutely hate facebook.
 

Similar threads

Gay Ops Communist Alt-Furry Enthusiast who claims to be a " Traditionalist Christian " despite having a vore + tranny fetish
Replies
27
Views
7K
Stole $4000+ from charity at his short-lived convention. Spent months failing to re-brand and sneak back into the fandom. Long history of abusive dating behavior. Currently impersonating and harassing a far more likable furry.
Replies
135
Views
21K
Top