the-great-escape

Author	SHA1	Message	Date
Adrian Malacoda	05c766011f	style and convention fixes to make pylint happy	2016-12-16 00:29:59 -06:00
Adrian Malacoda	1db0d315b8	add lint and pyinstaller targets	2016-12-15 23:49:42 -06:00
Adrian Malacoda	e574207656	move venv logic to makefile	2016-12-15 23:14:42 -06:00
Adrian Malacoda	9765675925	use python3 -m venv to create virtualenv	2016-11-28 03:07:31 -06:00
Adrian Malacoda	93db0de79f	Merge branch 'master' of gitlab.monarch-pass.net:malacoda/the-great-escape	2016-11-28 02:59:16 -06:00
Adrian Malacoda	8c1f8a7887	add a script that can set up virtualenv if it doesn't exist, and run tge in virtualenv	2016-11-28 02:58:50 -06:00
Adrian Malacoda	e93f2ba574	should probably sanitize category title as well	2016-11-27 21:15:39 -06:00
Adrian Malacoda	347a50bf6e	need to quote key	2016-11-27 21:14:49 -06:00
Adrian Malacoda	77775ae0be	need to pass url to scrape_board_from_document	2016-11-27 17:52:47 -06:00
Adrian Malacoda	acdf659e4a	dry up pagination logic using a generator	2016-11-27 17:50:25 -06:00
Adrian Malacoda	4cd9b22eb9	Use a loop to iterate thread/board pages, not recursion. For large threads this can cause a stack overflow. Also, since we're no longer doing the http request in the same function that does the scraping, we need to limit the @retry to the function that actually does the http call as that's what we want to be retrying.	2016-11-27 17:42:46 -06:00
Adrian Malacoda	b67ab06b55	Add exponential backoff for retrying	2016-11-27 13:03:13 -06:00
Adrian Malacoda	f47895fd46	add retrying dependency	2016-11-27 13:01:20 -06:00
Adrian Malacoda	83802088bf	url -> in	2016-11-27 02:07:21 -06:00
Adrian Malacoda	6e184478e0	commas and quotes too	2016-11-27 02:06:38 -06:00
Adrian Malacoda	a517f5c28c	sanitize some more characters. Not all of these might be unsafe but some are at least weird looking in filenames	2016-11-27 02:06:07 -06:00
Adrian Malacoda	bb19533df6	__sanitize_title -> sanitize_title	2016-11-27 02:04:54 -06:00
Adrian Malacoda	c52f472091	redo main() so it can work with either the local filesystem or urls. rename --url to --in for consistency.	2016-11-27 02:04:44 -06:00
Adrian Malacoda	808677b327	split the sanitize_title into a util module	2016-11-27 02:04:20 -06:00
Adrian Malacoda	ea6b0ddb1f	need to write in binary mode	2016-11-27 01:58:47 -06:00
Adrian Malacoda	aae663d518	add pickle scraper which can load serialized pickle files	2016-11-27 01:58:39 -06:00
Adrian Malacoda	dda2f183e3	if can_scrape_url isn't available don't call it	2016-11-27 01:58:25 -06:00
Adrian Malacoda	30aeead404	add import	2016-11-27 01:52:45 -06:00
Adrian Malacoda	43f1f6a680	add pickle outputter	2016-11-27 01:42:59 -06:00
Adrian Malacoda	ac76474030	need to clean up title by removing certain characters	2016-11-27 01:41:17 -06:00
Adrian Malacoda	2e93aaa9bc	want to output board description	2016-11-27 01:21:07 -06:00
Adrian Malacoda	d54f3ec21c	there's multiple h1's on the page and the one we want is like .eq(2) or something. But once you start addressing nodes by index like that you get real brittle and can break easily. I don't think we have a problem with just selecting all h1's here.	2016-11-27 01:19:59 -06:00
Adrian Malacoda	39b8bfff30	grab board description from forum index (we can't get it from the board index)	2016-11-27 01:13:45 -06:00
Adrian Malacoda	3f4eecc238	use dateutil to parse rfc3339 datetime strings in <time> elements, if they are present.	2016-11-27 01:10:04 -06:00
Adrian Malacoda	c800312423	add python-dateutil dep	2016-11-27 01:01:52 -06:00
Adrian Malacoda	5bcb6e8884	add extra post & user info	2016-11-27 00:48:55 -06:00
Adrian Malacoda	71a4f8c5a4	add extra user metadata such as title and avatar	2016-11-27 00:43:11 -06:00
Adrian Malacoda	de89ddb350	Add timestamp to post model	2016-11-27 00:42:27 -06:00
Adrian Malacoda	61e25fe9d9	example of large thread	2016-11-27 00:34:59 -06:00
Adrian Malacoda	c83d4a9916	for now, limit to forumer forums (fr.yuku.com) as I'm not sure if this scraper will support non-forumer ones	2016-11-27 00:18:39 -06:00
Adrian Malacoda	55176e4596	more examples in readme	2016-11-27 00:17:04 -06:00
Adrian Malacoda	741573d30a	only want first h1/h2 etc	2016-11-27 00:16:21 -06:00
Adrian Malacoda	ea46ae8853	.text() not text	2016-11-27 00:14:16 -06:00
Adrian Malacoda	9c401cbfb1	need to use .items() grumble grumble	2016-11-27 00:11:42 -06:00
Adrian Malacoda	b304297019	fix signature parsing, use html instead of text. Unfortunately there's a lot of garbage here we'll have to clean up	2016-11-27 00:03:30 -06:00
Adrian Malacoda	6fb7980218	make threads subdir under board so we can put an index.json there with board metadata	2016-11-26 23:54:05 -06:00
Adrian Malacoda	eabf099f47	fix for yuku's broken postbit markup	2016-11-26 23:42:30 -06:00
Adrian Malacoda	f4540d4030	expand readme	2016-11-26 23:16:06 -06:00
Adrian Malacoda	c04c030540	add user object	2016-11-26 23:14:09 -06:00
Adrian Malacoda	933e178ce5	initial commit for the-great-escape yuku scraper	2016-11-26 23:09:12 -06:00
Adrian Malacoda	e5fb7e5c9a	initial commit	2016-11-26 21:02:28 -06:00

46 Commits