Archival Tools - If you make a thread without archiving shit I will strangle you.

BlancoMailo

True & Honest Fan
kiwifarms.net
I want to archive someone's entire twitter. They have 41.6K tweets. Is there any feasible way to do this
Depends how good you are at working with bubblegum and shoestrings:
Recently I wanted to try archiving some furries' twitter timelines, as well as their followers and following lists. I went looking around online for tools that would help with this, but Google was interested only in showing me shady shit with price tags upwards of $40 per use. These tools also require you to have a Twitter account and access to Twitter's developer interface, which requires an application and has an approval process, and fuck that noise.

So, I went open source. And lo and behold, I got things to work properly. I'm happy to share with you a guide on how to go full NSA on somebody's public twitter feed, without even needing to be logged into a twitter account. This also will work on any operating system that can support Python, so basically any operating system not developed by a lolcow.

DISCLAIMER: I am not a programmer. This guide covers use of coding tools that I barely know how to use, let alone how to use safely. I am literally a script kiddie playing with scripts. Fuck around with these tools at your own risk.

We'll be covering how to use Twint, an open-source Twitter scraping tool coded in Python. The developers are working on a desktop app of this, but for now, you need to have Python installed in order to run it.

You can download Python here. If you're installing it on Windows, you'll want to make sure that these two options are checked during install:

View attachment 1455428 View attachment 1455433

The first, pip, lets you download and install Python shit with a single command. The second lets you run Python shit from the command line. Get Python installed, then operate up a command line interface (cmd if you're neurotypical; PowerShell, bash, or god knows what else otherwise) and type in the following command:

Code:
pip3 install twint
If you see a series of messages relating to downloading and installing shit from the internet and System32 doesn't vanish, you've done it right. Wait for the gears to stop whirring, then use the command prompt to navigate to a folder you know how to find again, such as Downloads if you're a heathen who saves everything to Downloads. Then, refer to the documentation in Twint's github wiki to build a command to harvest the Twitter account of your choice. As a note, this doesn't work on protected Twitters, so you won't be able to trawl somebody's AD timeline. What a shame.

Here's some examples of commands you can use:

Code:
 twint -u khordkitty -o khord_tweets.csv --csv
This command will pull everything down from KhordKitty's Twitter timeline and save it to a comma-separated values (CSV) file. CSV files are wonderful contraptions that can be opened in Excel or another spreadsheet editor, where you can run all manner of analytics on them. See attached for a sample of the results!

Code:
 twint -u khordkitty --followers -o khord_followers.csv --csv
This command will rip somebody's follower list from start to finish, saving every follower's username as a nice list.

Code:
 twint -u khordkitty --followers --user-full -o khord_followers.csv --csv
This command will rip somebody's follower list from start to finish, saving every follower's name, username, bio, location, join date, and various other shit as a nice list. WARNING: It also takes a lot fuckin longer to work.

Code:
 twint -u khordkitty --following -o khord_following.csv --csv
Same as the command to rip a follower list, except that this time, it collects the list of everybody they're following instead so you can see how many AD twitters they're jerking off to. The --user-full argument works here too, with the same caveat about taking longer.

That's just a few of the wonderful things you can do with twint. However, as I have learned, twint is not without its limitations. One such limitation I have observed is that Twitter does not like being scraped and will stop responding to scrapes after 14,400 tweets in succession have been scraped. Once Twitter so lashes out at you, it'll take a few minutes before Twint starts working again.

Twint has some workarounds for this -- such as allowing you to use the --year argument to only pull tweets from before a given year -- but it's still annoying as hell. I'll have to experiment further with it. Also, as you may have guessed, this only grabs the text of tweets, including image URLs, so if you really want to go full archivist on some faggot you'll need to use some additional code. Thankfully, @Warecton565 is to the rescue, with some code in a wonderful post. The code is not perfect; I did have to make a bit of a change just to get it to work at all:

Python:
import csv
from sys import argv
import requests
import json

tweets = csv.DictReader(open(argv[1], encoding="utf8"))

for tweet in tweets:
    pics = json.loads(tweet["photos"].replace("'", '"'))
    for pic in pics:
        r = requests.get(pic)
   
        open(argv[2] + "/" + pic.split("/media/")[1], "wb").write(r.content)
And I had to make the folder it was going to output to first and run the damn thing in IDLE before it would do anything. However, once you get it working, you can wind up with a folder containing hundreds of furry porn images and fursuit photos in mere minutes. Maybe even a photo of a dong or two.

Have fun! If you have anything you would like added to this post, please let me know via PM or some other means. Again, I'm just a fucking script kiddie.
 

Iris Hunter

Eternal Retart
kiwifarms.net
I was originally going to ask how to archive Instagram profiles to have a more permanent evidence for a personal situation I'm in, though I thought everyone here could benefit from this.

If I'm guessing correctly, you need an Instagram account to view Instagram itself, and perhaps why web.archive&most archival sites fail in that regard.

I've seen archive.md/vn/md/today/whatever create placeholder accounts to bypass this kind of stuff, but it seemed that it doesn't work with Instagram, looped until it failed on my end.

As I was writing this post however I learned that it works fine with Instagram posts, so as long as the link format is "https://www.instagram.com/p/insertposthere" and not "https://instagram.com/insertuserhere" it should work.

Still, I'd really appreciate a way of archiving the profile page, The best I have in that regard is my browsers built-in screenshot feature that I use for full screen screenshots(I don't know shit about Google Chrome but Mozilla Firefox and Opera/Opera GX have these. There's probably an extension for this for Google Chrome/any Chromium based browser though).

If anyone stumbles on this post and wants to archive Instagram media&stories, I archive Instagram media with this extension (This works fine on my browser which isn't Google Chrome but can use Chrome extensions), It can save media from an entire profile and any available stories at once&you can download them as .zip. I'm not sure about the livestreams though(The extension page lists that you can view livestreams from it so perhaps it can save livestreams that are available to view after the livestream finishes).
 
  • Winner
Reactions: anameisaname

BlancoMailo

True & Honest Fan
kiwifarms.net
While this is bumped I had another question. The archive.md guy is apparently very unstable.
Are there any contingency plans for important cow content in the event that site ever folds?
I know Null has enough shit going on with targets on his back, but perhaps a kiwifarms-owned archiving service to back up thread OP links (text only if needed for space reasons) would be a good future-proofing strategy? I'm aware people outside the farms will claim the content is faked but what else is new.
I know it's not going to happen but I wish @Null had the resources for it. The archive.md just blocked Brave again as the feud continues:
Again....png

Update, (8/11/20):
"aJfPgX0.png
https://i.imgur.com/aJfPgX0.png

Aryanization as it is - stealing about $1500 (voluntary donations of website users) explaining this with discrediting on a national basis

UPDATE:
tumblr_654452d40eaf91d6fe227291e6196cfd_5df0aec4_500.png
tumblr_da48868c2fbb38a54a746570c84b7f3e_86158f5c_540.png"

 
Last edited:

Gustav Schuchardt

Trans exclusionary radical feminazi.
True & Honest Fan
kiwifarms.net
For this post I need to cut the video from 9m15s to the end.

Given ffmpeg fucks up cuts in copy mode, I thought I'd try MP4Box.

You can do this like this - calculate the time in seconds using a calculator. For some reason I had to rename the MP4 file to In.mp4 because MP4Box didn't like "Sargon Fails, Homosexual Voltron, FBI robot! Is the Soyless Matt Show #7-lwbFHph0f0Y-360p.mp4"

MP4Box -add In.mp4 -splitx 555:end -new Cut.mp4

or like this, make Bash do it

MP4Box -add In.mp4 -splitx $((9*60+15)):end -new Cut.mp4

It's quick but it's still got a weird glitch at the beginning where there's audio but no video. I.e. it gets it wrong in the same way ffmpeg does.

I still can't figure out a tool that will round the start back to the nearest keyframe and the end forward to the nearest keyframe and then do a lossless cut without leaving weird sound without video. If anyone knows, post here.

Right now the only way I know to do it is to reencode with ffmpeg which seems a bit wasteful.

ffmpeg -ss 00:09:15 -i Sargon\ Fails\,\ Homosexual\ Voltron\,\ FBI\ robot\!\ Is\ the\ Soyless\ Matt\ Show\ #7-lwbFHph0f0Y-360p.mp4 -c:v libx264 -preset medium -crf 23 -c:a copy Cut.mp4
 
Last edited:
  • Winner
Reactions: anameisaname

Gustav Schuchardt

Trans exclusionary radical feminazi.
True & Honest Fan
kiwifarms.net
That's a mobile app, right? I'm looking for a command-line tool I can run on mac and Windows. E.g this MP4Splitter tool does it, and is cross-platform

https://sourceforge.net/projects/mp4joiner/files/MP4Tools/3.8/

It's a bit clunky though and it's a GUI app. Actually, if I could dump all the I frame times, i.e. the start of all GOPs I could probably just use the copy codec with ffmpeg at GOP boundaries and it would work.

I worked out how to dump the i-frame time codes

https://superuser.com/questions/554...keyframe-before-a-given-timestamp-with-ffmpeg

Code:
ffprobe -select_streams v -skip_frame nokey -show_frames         -show_entries frame=pkt_pts_time,pict_type Sargon\ Fails\,\ Homosexual\ Voltron\,\ FBI\ robot\!\ Is\ the\ Soyless\ Matt\ Show\ #7-lwbFHph0f0Y-360p.mp4 > iframes.txt
If I try and cut at those boundaries MP4Box still tries to adjust to 'the nearest random access point'.

Code:
$ MP4Box -add In.mp4 -splitx  552.552000:end -new Cut.mp4
IsoMedia import In.mp4 - track ID 1 - Video (size 640 x 360)
IsoMedia import In.mp4 - track ID 2 - Audio (SR 44100 - 2 channels)
Adjusting chunk start time to previous random access at 552.55 sec
Extracting chunk Cut_552_2535.mp4 - duration 1982.71s (552.55s->2535.26s)
Warning: Edit list doesn't look like a track delay scheme - ignoring
Warning: Edit list doesn't look like a track delay scheme - ignoring
It turns out that not all I-frames are random access points as explained cryptically here

https://forum.doom9.org/showthread.php?p=1502905#post1502905

MP4Box doesn't mark non-IDR I-frames as random access points because 14496-15 doesn't allow marking them as sync samples.
Seeking of H.264/AVC stream with Open-GOP in MP4 is incomplete on its spec currently.
The solution of this will be defined in 14496-12:2008/Amd.3 (under 'iso6' brand).
So I recommend you should avoid AVC-in-MP4 with Open-GOP at present.
Eh, I'll need to do some more research to figure this shit out.

I found this helpful post, on <DAME PESOS VOICE>fucking reddit</DAME PESOS VOICE>

https://snew.notabug.io/r/ffmpeg/comments/6p4i4g/keyframe_issues_with_concat/
https://archive.vn/wip/yfmF7

ffmpeg -ss X -i input.mp4 -c copy -t X -avoid_negative_ts make_zero output.mp4

Testing here it seems to work in VLC

ffmpeg -ss 00:09:15 -i Sargon\ Fails\,\ Homosexual\ Voltron\,\ FBI\ robot\!\ Is\ the\ Soyless\ Matt\ Show\ #7-lwbFHph0f0Y-360p.mp4 -c copy -avoid_negative_ts make_zero Out.mp4



However the post notes, ominously, that

If you cut on a non-keyframe, all earlier frames upto the keyframe are still included in the file but with negative timestamps. A good player will use those frames to decode frames from the cut point but not display these preceding frames. During concat, ffmpeg drops these frames because their offset timestamps (assigned timestamps within the concat output file) clash with the ending frames of the preceding input. So, the initial part of the following file can no longer be decoded and played.
This makes me think bad players will choke horribly because they can barely manage to decode normal H.264 streams and will certainly fail to grok the 'I've put all these frames in but set the time stamp to zero so just skip them' trick. And what do you know, Chrome on MacOS plays sound with no video for the first few seconds. Fuck this gay Earth.

If you have a Mac it turns out that QuickLook has something called QuickTrim that mentioned here

https://news.ycombinator.com/item?id=22775502
https://archive.vn/HaUng
Tried it here and it works. I can see why audiovisual professionals buy Macs, to be honest. You can cut the video by pressing space, dragging a slider, and then clicking Done. You don't need to know any propeller headed bullshit about keyframes, iframes, GOPS, or RATs.

On Windows, I bet something like VirtualDub would do it.
 
Last edited:

Gustav Schuchardt

Trans exclusionary radical feminazi.
True & Honest Fan
kiwifarms.net
Here's an interesting thing. Consider this video

https://www.youtube.com/watch?v=FPy2OHfLJc8

It is age-restricted. This article explains how to set up a cookies.txt file using a Chrome or Chromium extension

https://daveparrish.net/posts/2018-06-22-How-to-download-private-YouTube-videos-with-youtube-dl.html
https://archive.vn/rDsGA

I found if I grabbed the cookies from youtube.com and google.com and pasted them into ~/cookies.txt I could download the video like this

youtube-dl --cookies=~/cookies.txt -f 134+140 https://www.youtube.com/watch?v=FPy2OHfLJc8

Obviously you should be very careful with cookies.txt because any tool that has it can use it to impersonate you on Google or Youtube.
 

BlancoMailo

True & Honest Fan
kiwifarms.net
Seriously, if anyone happens to know of an alternative as functional as archive.md (outside of the wayback machine), it'd be great. The only other one I'm currently aware of is https://archive.st/ and it doesn't work with instagram.

The archive.md guy's going off the deep end responding to anons about Brave being racist and being happy to limit usage from 'toxic' websites.
1.png

2.png


Mind you, this is apparently after they already paid him the money. He doesn't seem to understand that all of the things he's bitching about are the direct result of the USA PATRIOT Act and not Brave having random racial hatred for a group of nations that just happen to be the global hubs of money laundering and cryptoscams.
 

Gustav Schuchardt

Trans exclusionary radical feminazi.
True & Honest Fan
kiwifarms.net
Consider this video

https://www.youtube.com/watch?v=xON7DhhGI18

youtube-dl (and thus ytdl) fails like this.

Code:
$ youtube-dl --cookies=~/cookies.txt -F xON7DhhGI18
[youtube] xON7DhhGI18: Downloading webpage
ERROR: xON7DhhGI18: YouTube said: Unable to extract video data
No magic invocations of --cookies or --user-agent seemed to help.

I think what's going on here is that the video is age-restricted and it wants a sign in. E.g. if you look here.

https://archive.vn/aSBAe
1599142115372.png

Now I can view this fine from Chrome so I resorted to this extension

https://chrome.google.com/webstore/detail/youtube-video-downloader/gjndphdopaigpbbhdlgphjgfccacnbja

This seems to be able to download it just fine, copy here. The extension is just a wrapper around this website-

https://downverse.com/

As to why Downverse seems to be able to download I've no idea. Presumably, it manages to get past the age block by having better cookies than me. Downverse is pretty civilized - you can choose what resolution you want. So if youtube-dl/ytdl fails give the extension a try or use Downverse directly. I sort of suspect the video is age-restricted and unlisted and double-secret shadowbanned by Google, to be honest, and youtube-dl can't handle that. Needless to say, this is exactly the sort of thing that needs to be archived.
 
Last edited:

anameisaname

I'm going to unlock all the achievements.
kiwifarms.net
The existing tool's can be found here, along with some discussion on the archival issue.

Github
1599086819321.png
Issue's with brave


Here's hoping generous Josh continues to deliver. I hope the tech volk can add to this thread.
Saint Jersh.PNG


Here's some archival tool's








pkg install python
pip install youtube-dl
youtube-dl --version
How to download age-restricted YouTube videos using youtube-dl

Login into YouTube using your browser
After logging in, export the cookies into a cookies.txt file (Netscape format)
Add --cookies /path/to/cookies.txt to your youtube-dl command
You can now download age-restricted videos on YouTube


You can export the cookies using an extension or something
Firefox: https://addons.mozilla.org/en-US/firefox/addon/export-cookies-txt/
Chrome: https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg


















Screenshot_2020-08-13-19-22-07-1.png
 

Ask Jeeves

True & Honest Fan
kiwifarms.net
The main problem with most of those is that they can be tampered with. That's the reason a 3rd party archival server is so important and probably why most of these aren't listed in the archive everything thread.

Also some of these tools have had really detailed guides written by kiwi's if you want to organize them in one location.
 
  • Winner
Reactions: anameisaname

anameisaname

I'm going to unlock all the achievements.
kiwifarms.net
Thanks but Prospering Grounds is for lolcow threads. There is already a thread for Archival Tools. I meant someone should make a thread for this Brave thing as like a current event. :)
Yeah I guess I wasn't descriptive enough the archival tool's thread is great yet this was meant to be a general discussion thread from archive to brave etc
:heart-empty:
 
  • Feels
Reactions: Twinkie

Blondie

Forever young, I want to be forever young.
kiwifarms.net
What a fucking chad, honestly with Josh's recent developments in destroying a minecraft server, blaming all of us for it going wrong, and everything, I doubt he'll make a replacement for archive.md
 
  • Lunacy
Reactions: anameisaname

ducktales4gameboy

archive what you want to remember
kiwifarms.net
Does anyone know if there's a crawler that can capture all of the notes on a tumblr blog (which are usually streamed in batches as you scroll through the notes list?) There have been several incredibly inflammatory posts floating around in the last week or so with tons of salty replies and the usual archive methods fail due to the aforementioned streaming. Trying to hit the API for the notes doesn't seem to work either because it's been fucked and malfunctioning for a month or so now.
 
  • Thunk-Provoking
Reactions: anameisaname
Tags
None