Commit Graph

  • 05c766011f style and convention fixes to make pylint happy master Adrian Malacoda 2016-12-16 00:29:59 -06:00
  • 1db0d315b8 add lint and pyinstaller targets Adrian Malacoda 2016-12-15 23:49:42 -06:00
  • e574207656 move venv logic to makefile Adrian Malacoda 2016-12-15 23:14:42 -06:00
  • 9765675925 use python3 -m venv to create virtualenv Adrian Malacoda 2016-11-28 03:07:31 -06:00
  • 93db0de79f Merge branch 'master' of gitlab.monarch-pass.net:malacoda/the-great-escape Adrian Malacoda 2016-11-28 02:59:16 -06:00
  • 8c1f8a7887 add a script that can set up virtualenv if it doesn't exist, and run tge in virtualenv Adrian Malacoda 2016-11-28 02:58:50 -06:00
  • e93f2ba574 should probably sanitize category title as well Adrian Malacoda 2016-11-27 21:15:39 -06:00
  • 347a50bf6e need to quote key Adrian Malacoda 2016-11-27 21:14:49 -06:00
  • 77775ae0be need to pass url to scrape_board_from_document Adrian Malacoda 2016-11-27 17:52:47 -06:00
  • acdf659e4a dry up pagination logic using a generator Adrian Malacoda 2016-11-27 17:50:25 -06:00
  • 4cd9b22eb9 Use a loop to iterate thread/board pages, not recursion. For large threads this can cause a stack overflow. Also, since we're no longer doing the http request in the same function that does the scraping, we need to limit the @retry to the function that actually does the http call as that's what we want to be retrying. Adrian Malacoda 2016-11-27 17:42:46 -06:00
  • b67ab06b55 Add exponential backoff for retrying Adrian Malacoda 2016-11-27 13:03:13 -06:00
  • f47895fd46 add retrying dependency Adrian Malacoda 2016-11-27 13:01:20 -06:00
  • 83802088bf url -> in Adrian Malacoda 2016-11-27 02:07:21 -06:00
  • 6e184478e0 commas and quotes too Adrian Malacoda 2016-11-27 02:06:38 -06:00
  • a517f5c28c sanitize some more characters. Not all of these might be unsafe but some are at least weird looking in filenames Adrian Malacoda 2016-11-27 02:06:07 -06:00
  • bb19533df6 __sanitize_title -> sanitize_title Adrian Malacoda 2016-11-27 02:04:54 -06:00
  • c52f472091 redo main() so it can work with either the local filesystem or urls. rename --url to --in for consistency. Adrian Malacoda 2016-11-27 02:04:44 -06:00
  • 808677b327 split the sanitize_title into a util module Adrian Malacoda 2016-11-27 02:04:20 -06:00
  • ea6b0ddb1f need to write in binary mode Adrian Malacoda 2016-11-27 01:58:47 -06:00
  • aae663d518 add pickle scraper which can load serialized pickle files Adrian Malacoda 2016-11-27 01:58:39 -06:00
  • dda2f183e3 if can_scrape_url isn't available don't call it Adrian Malacoda 2016-11-27 01:58:25 -06:00
  • 30aeead404 add import Adrian Malacoda 2016-11-27 01:52:45 -06:00
  • 43f1f6a680 add pickle outputter Adrian Malacoda 2016-11-27 01:42:59 -06:00
  • ac76474030 need to clean up title by removing certain characters Adrian Malacoda 2016-11-27 01:41:17 -06:00
  • 2e93aaa9bc want to output board description Adrian Malacoda 2016-11-27 01:21:07 -06:00
  • d54f3ec21c there's multiple h1's on the page and the one we want is like .eq(2) or something. But once you start addressing nodes by index like that you get real brittle and can break easily. I don't think we have a problem with just selecting all h1's here. Adrian Malacoda 2016-11-27 01:19:59 -06:00
  • 39b8bfff30 grab board description from forum index (we can't get it from the board index) Adrian Malacoda 2016-11-27 01:13:45 -06:00
  • 3f4eecc238 use dateutil to parse rfc3339 datetime strings in <time> elements, if they are present. Adrian Malacoda 2016-11-27 01:10:04 -06:00
  • c800312423 add python-dateutil dep Adrian Malacoda 2016-11-27 01:01:52 -06:00
  • 5bcb6e8884 add extra post & user info Adrian Malacoda 2016-11-27 00:48:55 -06:00
  • 71a4f8c5a4 add extra user metadata such as title and avatar Adrian Malacoda 2016-11-27 00:43:11 -06:00
  • de89ddb350 Add timestamp to post model Adrian Malacoda 2016-11-27 00:42:27 -06:00
  • 61e25fe9d9 example of large thread Adrian Malacoda 2016-11-27 00:34:59 -06:00
  • c83d4a9916 for now, limit to forumer forums (fr.yuku.com) as I'm not sure if this scraper will support non-forumer ones Adrian Malacoda 2016-11-27 00:18:39 -06:00
  • 55176e4596 more examples in readme Adrian Malacoda 2016-11-27 00:17:04 -06:00
  • 741573d30a only want first h1/h2 etc Adrian Malacoda 2016-11-27 00:16:21 -06:00
  • ea46ae8853 .text() not text Adrian Malacoda 2016-11-27 00:14:16 -06:00
  • 9c401cbfb1 need to use .items() grumble grumble Adrian Malacoda 2016-11-27 00:11:42 -06:00
  • b304297019 fix signature parsing, use html instead of text. Unfortunately there's a lot of garbage here we'll have to clean up Adrian Malacoda 2016-11-27 00:03:30 -06:00
  • 6fb7980218 make threads subdir under board so we can put an index.json there with board metadata Adrian Malacoda 2016-11-26 23:54:05 -06:00
  • eabf099f47 fix for yuku's broken postbit markup Adrian Malacoda 2016-11-26 23:42:30 -06:00
  • f4540d4030 expand readme Adrian Malacoda 2016-11-26 23:16:06 -06:00
  • c04c030540 add user object Adrian Malacoda 2016-11-26 23:14:09 -06:00
  • 933e178ce5 initial commit for the-great-escape yuku scraper Adrian Malacoda 2016-11-26 23:09:12 -06:00
  • e5fb7e5c9a initial commit Adrian Malacoda 2016-11-26 21:02:28 -06:00