New OOTS products from CafePress
New OOTS t-shirts, ornaments, mugs, bags, and more
Results 1 to 12 of 12
  1. - Top - End - #1
    Ettin in the Playground
     
    thethird's Avatar

    Join Date
    Jan 2013

    Default [HTTRACK] Copying a webpage

    Hi all, I would appreciate some help.

    This coming months I need to do a lot of travel for work. While that happens I would like to be able to access reference information on some webpages. I have the biggest stuff print screened and saved, but it's a pain to deal with searches or then easily copy pasty-ing from it.

    I have been looking at legal options to be able to make a local (offline) copy of a website. I have found Httrack which seems a tested option and was wondering if anyone has experience with using it. If so what settings would you recomend taking care off? I have been reading through the faq and the gides and I am not certain I understand everything correctly.

    Alternatively has anyone used a different sort of legal program for this purpose?

    Thanks!
    Thanks a lot Gengy for the awesome... just a sec... avatar. :)

  2. - Top - End - #2
    Dwarf in the Playground
    Join Date
    Jan 2021

    Default Re: [HTTRACK] Copying a webpage

    Quote Originally Posted by thethird View Post
    Hi all, I would appreciate some help.


    This coming months I need to do a lot of travel for work. While that happens I would like to be able to access reference information on some webpages. I have the biggest stuff print screened and saved, but it's a pain to deal with searches or then easily copy pasty-ing from it.

    I have been looking at legal options to be able to make a local (offline) copy of a website. I have found Httrack which seems a tested option and was wondering if anyone has experience with using it. If so what settings would you recomend taking care off? I have been reading through the faq and the gides and I am not certain I understand everything correctly.

    Alternatively has anyone used a different sort of legal program for this purpose?

    Thanks!
    On the technical level I can probably help
    on the legal level I can only inform.

    Technical:
    Settings you will have to do before you can start and the advised values:

    The location you want it to be saved to: the map you've been saving you're non-automated websites to.
    Which links to skip: ftp*(unless you also want to download every file on the website).
    Name of the project: website name.
    proxy: probably none, unless you either know what proxy means or have a system or browser wide vpn.
    I'm available for more questions.


    Legally:

    Everyone does it and so once in a while governments pass laws to protect it, but in reality the legal side is still and was always very unclear unless you use some specific subset of websites, due to it being on the internet.

    My reasons for, why I don't know and probably only a copyright lawyer would know whether this violates laws are:
    a. I'm not a qualified lawyer in any jurisdiction.
    b. I don't know enough about the situation
    c. The laws involved are vaguely interpreted and formulated.
    d. I don't know which websites you're talking about(wikipedia and a few others addressed this issue by granting licenses for all their content)


    Explanation: technically all information on any website younger than 70 years is someone's copyright, due to the fact that copyright is automatically granted at the moment of creation.
    Most websites don't have an officially granted license to copy information of it.
    This would mean that everything anyone ever copies from website is illegal including things like one time authentication codes, but many countries have exceptions on copyright law that could apply to your usage.
    Examples of these exceptions are backup right(as long you don't maintain more than one copy), citation right and fair use.
    Protected copyright on websites has for example been enforced by news agencies against google in Australia, causing google to officially retreat out of Australia.
    Last edited by Smoutwortel; 2023-02-08 at 07:10 PM. Reason: added technical advice
    The closest I get to clear and consise:
    Quote Originally Posted by Justanotherhero View Post
    Interesting read! Thanks for the post!

  3. - Top - End - #3
    Ogre in the Playground
    Join Date
    Aug 2022

    Default Re: [HTTRACK] Copying a webpage

    Er. I have to ask: Why can't you access the web pages while traveling? Are these pages on your work/company's site? Or somewhere else? Or traveling somewhere where some third party site (or your own company's) isn't accessible? Just kinda confused as to "why" you feel the need to do this.

    Putting on my IT/security hat. My first inclincation is to say that "no. you should never do this". If some site is inaccessible where you're going and with whatever equipment/account/access you have there, then it's probably inaccessible for a reason. Circumventing that can get you in trouble with a whole host of both company rules and/or national laws (with some potentially very serious consequences).

    What kind of data are we talking about? If it's the equivalent of instructions on how to put your new bookcase together, that's probably fine (but why couldn't you access it remotely anyway?). If we're talking about technical/engineering specifications, or code bases, you could be really heading for trouble. There are some serious export compliance rules involving that stuff, and some serious penalities for violating them.

    Why are you asking this on a forum for a comic strip? This sounds like the kind of questions you should be asking your employer about, given that you are going on a "work trip" and presumably want/need this for that trip. Any business that does business internationally should have rules in place for just this situation, and methods in place for accessing information that is ok to access, and strict prohibitions on accessing stuff you should not. CCI/IP protection is serious business.

    Crossing an international border with a suitcase full of printed out technical specs/docs is espionage. You really really should seek guidance from your employer on this.
    Last edited by gbaji; 2023-02-09 at 08:14 PM.

  4. - Top - End - #4
    Dwarf in the Playground
    Join Date
    Jan 2021

    Default Re: [HTTRACK] Copying a webpage

    Quote Originally Posted by gbaji View Post
    Er. I have to ask: Why can't you access the web pages while traveling? Are these pages on your work/company's site? Or somewhere else? Or traveling somewhere where some third party site (or your own company's) isn't accessible? Just kinda confused as to "why" you feel the need to do this.

    Putting on my IT/security hat. My first inclincation is to say that "no. you should never do this". If some site is inaccessible where you're going and with whatever equipment/account/access you have there, then it's probably inaccessible for a reason. Circumventing that can get you in trouble with a whole host of both company rules and/or national laws (with some potentially very serious consequences).

    What kind of data are we talking about? If it's the equivalent of instructions on how to put your new bookcase together, that's probably fine (but why couldn't you access it remotely anyway?). If we're talking about technical/engineering specifications, or code bases, you could be really heading for trouble. There are some serious export compliance rules involving that stuff, and some serious penalities for violating them.

    Why are you asking this on a forum for a comic strip? This sounds like the kind of questions you should be asking your employer about, given that you are going on a "work trip" and presumably want/need this for that trip. Any business that does business internationally should have rules in place for just this situation, and methods in place for accessing information that is ok to access, and strict prohibitions on accessing stuff you should not. CCI/IP protection is serious business.

    Crossing an international border with a suitcase full of printed out technical specs/docs is espionage. You really really should seek guidance from your employer on this.
    I interpreted their text as indicating that the site was inaccessible, because they didn't have internet. On some moments(specifically in their mode of transport, which I expect to be an airplane.)
    The reason I read it that way can be explained with:
    a. Airplanes tend to be skittish about remote signals, because they also use them themselves and they want to sell their own wifi.
    b. Grammatically they speak of accessing it while traveling.
    The closest I get to clear and consise:
    Quote Originally Posted by Justanotherhero View Post
    Interesting read! Thanks for the post!

  5. - Top - End - #5
    Barbarian in the Playground
     
    Planetar

    Join Date
    Feb 2010

    Default Re: [HTTRACK] Copying a webpage

    I'm very new to http, so sorry if this is way off, but why can't you just save an offline copy to your pc?

  6. - Top - End - #6
    Ogre in the Playground
    Join Date
    Aug 2022

    Default Re: [HTTRACK] Copying a webpage

    Yeah. That could be the case. It was unclear whether he meant while physically traveling (as in on a plane say) or "while traveling" (meaning while not at/near the home office). But in either case, downloading whole sections of web sites is problematic.

    The worst CCI thefts/losses occur when someone downloads stuff to their laptop for a conference, meeting, whatever, then loses the laptop (or it gets swiped). It's like one of the biggest no-noes. That can be mitigated via various drive level encryption methodologies and/or detachable hard decription keys (although maybe avoid the really really dumb tiny inserted keys, since they just beg to be left in the usb port at all times, eliminating their security value). You should restrict yourself to maybe your presentation docs (powerpoint, word, excel) or whatever. Technical stuff? Should be left on the servers.

    If there's a legitimate business need to access company data/sites while traveling, there should be a means to do so, safely and securely, and remotely. And sure, this may create a problem while you're actually on a plane or whatever, but do you really need to access that during those time periods? Proabably not. I guess it really depends on what you are doing at your destination.

    And IME as someone who's been doing IT work for 30ish years (since before it was even called "IT"), in every case I've ever encountered where someone was certain that they needed to download/print a bunch of stuff for something, it turned out that they really didn't. Or, at the very least, really shouldn't. I have no clue what sort of business the OP is in, so it's hard to make a strong determination, but the most you should ever really actually *need* is whatever you're actually presenting somewhere. I'm just struggling to understand why you'd need something that is normally in a web site/page format in the first place. Again though, I'm coming from the perspective of an engineering/tech company, where data is king. For all I know, the OP is selling vacuums, or teaching a class, or giving a science talk, or whatever.

    In any case though, if there's a business need for specific documents/data to be available to a traveling employee, it's kinda the responsibilty of the employer to facilitate that, with whatever security concerns they have with that stuff addressed. The employee deciding, all on his own, to just download/copy stuff without going through the employer seems... problematic.

  7. - Top - End - #7
    Ogre in the Playground
    Join Date
    Aug 2022

    Default Re: [HTTRACK] Copying a webpage

    Quote Originally Posted by BaronOfHell View Post
    I'm very new to http, so sorry if this is way off, but why can't you just save an offline copy to your pc?
    If you're only dealing with very simple formatted text, you could do this. Anything more complex, with redirects, includes, embedded scripts, etc, you kinda have to have a web server running locally *and* include everything referenced to on the remote site, and honestly a ton of other stuff as well (I'm not a web admin myself). Any web browser can read and display only very basic stuff. Anything with "whistles and bells" requires pulling the code to execute (and understand) those things from the server hosting the pages you are viewing (which the browser knows how to do, but has to be on the server itself). Which isn't going to be on your PC, unless you install a whole lot of other stuff.

    Not knowing the nature of the pages he's trying to copy turns this into a guessing game though, with too many possible guesses. For a lot of doc style stuff, there may be methods to import them into another, more portable format (like acroread maybe). Some stuff, may be possible to download and make available locally (A vendor I work with has their entire suite of documentation enclosed in a single rpm for example, with a web front end reading it locally). Other things could already be in a format that is readable by other tools (powerpoint, excel, etc). Hard to say.

    I suppose you can always just screenshot stuff and store it as a set of images, but that's not going to have any heirarchical format to it (and good luck copy/pasting anything if needed). And again, depending on what is on those pages, and where you're putting it, and where you are traveling, this may be a serious no-no ranging from "fired from your job" to "spend time in prison". Or could be nothing at all. No clue.

  8. - Top - End - #8
    Troll in the Playground
     
    OracleofWuffing's Avatar

    Join Date
    Aug 2008

    Default Re: [HTTRACK] Copying a webpage

    Erm... If "Print Screen, Save as" is on the table, why not just "File > Print" and set your destination as "Save as PDF?"

    I know it won't preserve hyperlinks and whatnot, but that will keep it copy-pasta capable.
    "Okay, so I'm going to quick draw and dual wield these one-pound caltrops as improvised weapons..."
    ---
    "Oh, hey, look! Blue Eyes Black Lotus!" "Wait what, do you sacrifice a mana to the... Does it like, summon a... What would that card even do!?" "Oh, it's got a four-energy attack. Completely unviable in actual play, so don't worry about it."

  9. - Top - End - #9
    Titan in the Playground
     
    Tyndmyr's Avatar

    Join Date
    Aug 2009
    Location
    Maryland
    Gender
    Male

    Default Re: [HTTRACK] Copying a webpage

    Quote Originally Posted by gbaji View Post
    Yeah. That could be the case. It was unclear whether he meant while physically traveling (as in on a plane say) or "while traveling" (meaning while not at/near the home office). But in either case, downloading whole sections of web sites is problematic.
    Literally every time you visit a website, you are downloading those pages to your computer.

    There is no technical difference between a call for short term use or for long term.

    Also, a lot of industries utilize web scraping in non-trivial amounts. This sort of repurposing is fairly accepted in many contexts. Yeah, if you're doing scraping volume so large it introduces performance issues or trying to pass off the content of others as your own, problems will arise, but mere storage of a website isn't really controversial at all.

    If it were, sites like archive.org literally couldn't exist.


    And IME as someone who's been doing IT work for 30ish years (since before it was even called "IT"), in every case I've ever encountered where someone was certain that they needed to download/print a bunch of stuff for something, it turned out that they really didn't.
    I'm a software developer, and hard disagree. It isn't always the best solution, but it sometimes is.

  10. - Top - End - #10
    Ogre in the Playground
    Join Date
    Aug 2022

    Default Re: [HTTRACK] Copying a webpage

    Quote Originally Posted by Tyndmyr View Post
    Literally every time you visit a website, you are downloading those pages to your computer.

    There is no technical difference between a call for short term use or for long term.
    Yeah. I mean, if we want to get really technical in terms of how computers read data, that's true of everything. Yet, we still make legal distinctions between reading content and copying it.

    Short term (as in while the software you are running is reading and/or executing some other data) versus long term is massivly relevant and is, in actual fact, a major technical difference. One stores it as a temporary data file in a format for use by specific software (usually retained in memory or "temp" filespace), while the other stores it as an actual file on your hard drive. The former disappears the next time a cache clears, or the temp space is overwrittten, and (usually) doesn't survive a reboot of the computer. The latter is there until you delete it yourself.

    Go fire up a web browser and read this page. Turn off your network and try again. Fails to load, right? That's the "technical difference", and it's a big one. Sure. As long as you leave your browser window open, it'll still display whatever was on it last. Again. That's "technically" there. But that's not the same as having a copy of the data. That's just your window manager UI displaying stuff.

    I'm assuming he actually wanted to copy the data either in software form as files on his computer to review/access later, or printed out as hardcopy. Again though, I'm not certain. Hence why I asked several questions about the kind of data he's copying, and (most importantly) where that data normally is (on a work server? behind a firewall? public pages?).

    Quote Originally Posted by Tyndmyr View Post
    Also, a lot of industries utilize web scraping in non-trivial amounts. This sort of repurposing is fairly accepted in many contexts. Yeah, if you're doing scraping volume so large it introduces performance issues or trying to pass off the content of others as your own, problems will arise, but mere storage of a website isn't really controversial at all.

    If it were, sites like archive.org literally couldn't exist.
    Yes. I get that. This is why I asked him whether this is work data, or public data. If it's stuff that open to anyone with a browser anywhere on the internet, then if he wants to pull stuff off for reference, it's probably just fine. My only real question is why. He should be able to access it on any internet connected device wherever he's going. Someone mentioned wanting to look at stuff while on a plane, which I guess is valid. I'm not sure how critical that is, nor how much time/effort I'd put into it. But sure.

    My major concern, and where I've seen this bite people, is when it's data on a work server that isn't accessible remotely. And for some reason, he doesn't have whatever tools he needs to access it while on his trip. Or, as I've seen happen, is concerned about network connectivity, maybe the VPN doesn't always work, whatever, so he wants to make sure the (presumably protected from outside access for a reason) data is present directly on his laptop (or printed out maybe). And that's where, depending on the data itself, it could be a real problem

    And yes. In my experience, most of the time, there are tools for doing this remotely, but the person doesn't actually ask about them, doesn't take the time to install them on his device to correctly and securely access the company servers, and instead just circumvents everything by coyping it locally while attached to the company network, for use later. Which is where a lot of data loss occurs.

    There's literally a long standing joke at the company I work at about execs and laptops at tech conferences and the need to babysit them so they don't do something stupid (both when prepping for the trip, and often while there as well).


    Quote Originally Posted by Tyndmyr View Post
    I'm a software developer, and hard disagree. It isn't always the best solution, but it sometimes is.
    The 90s called. They want their sneakernet back.


    Again. Hugely depends on the data itself. But he mentioned web pages (well, http). That's generally going to be documentation/presentation stuff (again, assuming work related and not just that he wants to read fanfick while flying or something). There are usually bettter and more secure ways of doing that then just copying/duplicating the web page format itself. Doubly so if he's using http as an access portal to other forms of data (and you'd be amazed how many people aren't aware of this).

    Assuming work related stuff (otherwise why bother?), I'll still default to my initial response: Ask your employer. And if you are the employer, then "ask your IT guy(s)". Odds are they already have a solution for this exact situation, just waiting to tell you. Trying to re-invent the wheel usually results in wheels that fall off, or don't roll well.

  11. - Top - End - #11
    Ettin in the Playground
     
    thethird's Avatar

    Join Date
    Jan 2013

    Default Re: [HTTRACK] Copying a webpage

    Oh, this thread actually got responses sorry everyone got sidetracked with work and stuff.

    On some points:
    Quote Originally Posted by Smoutwortel View Post
    On the technical level I can probably help
    Thanks a lot.

    Quote Originally Posted by gbaji View Post
    Er. I have to ask: Why can't you access the web pages while traveling? Are these pages on your work/company's site? Or somewhere else? Or traveling somewhere where some third party site (or your own company's) isn't accessible? Just kinda confused as to "why" you feel the need to do this.

    [...]

    Crossing an international border with a suitcase full of printed out technical specs/docs is espionage. You really really should seek guidance from your employer on this.
    So... this took a dark turn. To summarize, Brexit happened. Before Brexit European phone plans meant that I had data on my company's phone when going to the UK from the mainland. After Brexit I no longer have a reliable access to secure internet on the go on demand (no connecting to the random wifi at the airport, nor the starbuck's hotspot). Internet is still available from work phone, and can be used/shared if necessary, but leads to a (billable) surcharge. Internet from personal phone would also be available, but...

    What I need available is reference documentation, that's public, but a pain on the backside to navigate. In particular, I work in Regulatory Banking, which is "FUN". Some of the reference material that I regularly use are from the regulators (such as EBA). And that's entirely available no problem, as long as you have connection. I have saved most regular use material, like the actual reports, and the instructions on how to fill them on my laptop (again this is a public document). But there are some FAQs and Q&A that would add value with a search function, to clarify finner points of regulation.

    That said, it's something that enriches value, and something I wanted to test if it was viable (more for personal curiosity). There are points, still, in which data is absolutely necessary (when I need to connect to a database, and check some actual data), in those cases it goes through work phone, VPN, and billing as necessary.

    Quote Originally Posted by OracleofWuffing View Post
    Erm... If "Print Screen, Save as" is on the table, why not just "File > Print" and set your destination as "Save as PDF?"

    I know it won't preserve hyperlinks and whatnot, but that will keep it copy-pasta capable.
    I ended doing something among those lines actually, save it to pdf, and then bind pdfs together.
    Thanks a lot Gengy for the awesome... just a sec... avatar. :)

  12. - Top - End - #12
    Ogre in the Playground
    Join Date
    Aug 2022

    Default Re: [HTTRACK] Copying a webpage

    Quote Originally Posted by thethird View Post
    What I need available is reference documentation, that's public, but a pain on the backside to navigate. In particular, I work in Regulatory Banking, which is "FUN". Some of the reference material that I regularly use are from the regulators (such as EBA). And that's entirely available no problem, as long as you have connection. I have saved most regular use material, like the actual reports, and the instructions on how to fill them on my laptop (again this is a public document). But there are some FAQs and Q&A that would add value with a search function, to clarify finner points of regulation.
    Ok. Fair enough. Didn't want to rain on parades or anything. It's just that as a long time IT engineer at a multi-national corporation, I'm trained to immediately ask questions like the ones above the moment anyone asks something like "how do I copy data from <somewhere> and put it <somewhere else>, so I can access it while traveling <somewhere>". Failure to do so can have serious ramifications. We've had people arrested, jailed, etc in foreign countries (sometimes, some really not nice foreign countries) because they innocently printed out stuff, not thinking it was an issue at all, only to run afoul of some regulations they were not aware of (and sometimes, this can just be an excuse for a shakedown basically). Given the vagueness of the OP, I choose to go the "better safe than sorry" route.


    Quote Originally Posted by thethird View Post
    I ended doing something among those lines actually, save it to pdf, and then bind pdfs together.
    Yeah. That should work for most docs you'd need.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •