46 Commits

Author SHA1 Message Date
Adrian Malacoda
05c766011f style and convention fixes to make pylint happy 2016-12-16 00:29:59 -06:00
Adrian Malacoda
1db0d315b8 add lint and pyinstaller targets 2016-12-15 23:49:42 -06:00
Adrian Malacoda
e574207656 move venv logic to makefile 2016-12-15 23:14:42 -06:00
Adrian Malacoda
9765675925 use python3 -m venv to create virtualenv 2016-11-28 03:07:31 -06:00
Adrian Malacoda
93db0de79f Merge branch 'master' of gitlab.monarch-pass.net:malacoda/the-great-escape 2016-11-28 02:59:16 -06:00
Adrian Malacoda
8c1f8a7887 add a script that can set up virtualenv if it doesn't exist, and run tge in virtualenv 2016-11-28 02:58:50 -06:00
Adrian Malacoda
e93f2ba574 should probably sanitize category title as well 2016-11-27 21:15:39 -06:00
Adrian Malacoda
347a50bf6e need to quote key 2016-11-27 21:14:49 -06:00
Adrian Malacoda
77775ae0be need to pass url to scrape_board_from_document 2016-11-27 17:52:47 -06:00
Adrian Malacoda
acdf659e4a dry up pagination logic using a generator 2016-11-27 17:50:25 -06:00
Adrian Malacoda
4cd9b22eb9 Use a loop to iterate thread/board pages, not recursion. For large threads this can cause a stack overflow. Also, since we're no longer doing the http request in the same function that does the scraping, we need to limit the @retry to the function that actually does the http call as that's what we want to be retrying. 2016-11-27 17:42:46 -06:00
Adrian Malacoda
b67ab06b55 Add exponential backoff for retrying 2016-11-27 13:03:13 -06:00
Adrian Malacoda
f47895fd46 add retrying dependency 2016-11-27 13:01:20 -06:00
Adrian Malacoda
83802088bf url -> in 2016-11-27 02:07:21 -06:00
Adrian Malacoda
6e184478e0 commas and quotes too 2016-11-27 02:06:38 -06:00
Adrian Malacoda
a517f5c28c sanitize some more characters. Not all of these might be unsafe but some are at least weird looking in filenames 2016-11-27 02:06:07 -06:00
Adrian Malacoda
bb19533df6 __sanitize_title -> sanitize_title 2016-11-27 02:04:54 -06:00
Adrian Malacoda
c52f472091 redo main() so it can work with either the local filesystem or urls. rename --url to --in for consistency. 2016-11-27 02:04:44 -06:00
Adrian Malacoda
808677b327 split the sanitize_title into a util module 2016-11-27 02:04:20 -06:00
Adrian Malacoda
ea6b0ddb1f need to write in binary mode 2016-11-27 01:58:47 -06:00
Adrian Malacoda
aae663d518 add pickle scraper which can load serialized pickle files 2016-11-27 01:58:39 -06:00
Adrian Malacoda
dda2f183e3 if can_scrape_url isn't available don't call it 2016-11-27 01:58:25 -06:00
Adrian Malacoda
30aeead404 add import 2016-11-27 01:52:45 -06:00
Adrian Malacoda
43f1f6a680 add pickle outputter 2016-11-27 01:42:59 -06:00
Adrian Malacoda
ac76474030 need to clean up title by removing certain characters 2016-11-27 01:41:17 -06:00
Adrian Malacoda
2e93aaa9bc want to output board description 2016-11-27 01:21:07 -06:00
Adrian Malacoda
d54f3ec21c there's multiple h1's on the page and the one we want is like .eq(2) or something. But once you start addressing nodes by index like that you get real brittle and can break easily. I don't think we have a problem with just selecting all h1's here. 2016-11-27 01:19:59 -06:00
Adrian Malacoda
39b8bfff30 grab board description from forum index (we can't get it from the board index) 2016-11-27 01:13:45 -06:00
Adrian Malacoda
3f4eecc238 use dateutil to parse rfc3339 datetime strings in <time> elements, if they are present. 2016-11-27 01:10:04 -06:00
Adrian Malacoda
c800312423 add python-dateutil dep 2016-11-27 01:01:52 -06:00
Adrian Malacoda
5bcb6e8884 add extra post & user info 2016-11-27 00:48:55 -06:00
Adrian Malacoda
71a4f8c5a4 add extra user metadata such as title and avatar 2016-11-27 00:43:11 -06:00
Adrian Malacoda
de89ddb350 Add timestamp to post model 2016-11-27 00:42:27 -06:00
Adrian Malacoda
61e25fe9d9 example of large thread 2016-11-27 00:34:59 -06:00
Adrian Malacoda
c83d4a9916 for now, limit to forumer forums (fr.yuku.com) as I'm not sure if this scraper will support non-forumer ones 2016-11-27 00:18:39 -06:00
Adrian Malacoda
55176e4596 more examples in readme 2016-11-27 00:17:04 -06:00
Adrian Malacoda
741573d30a only want first h1/h2 etc 2016-11-27 00:16:21 -06:00
Adrian Malacoda
ea46ae8853 .text() not text 2016-11-27 00:14:16 -06:00
Adrian Malacoda
9c401cbfb1 need to use .items() grumble grumble 2016-11-27 00:11:42 -06:00
Adrian Malacoda
b304297019 fix signature parsing, use html instead of text. Unfortunately there's a lot of garbage here we'll have to clean up 2016-11-27 00:03:30 -06:00
Adrian Malacoda
6fb7980218 make threads subdir under board so we can put an index.json there with board metadata 2016-11-26 23:54:05 -06:00
Adrian Malacoda
eabf099f47 fix for yuku's broken postbit markup 2016-11-26 23:42:30 -06:00
Adrian Malacoda
f4540d4030 expand readme 2016-11-26 23:16:06 -06:00
Adrian Malacoda
c04c030540 add user object 2016-11-26 23:14:09 -06:00
Adrian Malacoda
933e178ce5 initial commit for the-great-escape yuku scraper 2016-11-26 23:09:12 -06:00
Adrian Malacoda
e5fb7e5c9a initial commit 2016-11-26 21:02:28 -06:00